pub unsafe extern "C" fn cuTensorMapEncodeIm2col(
tensorMap: *mut CUtensorMap,
tensorDataType: CUtensorMapDataType,
tensorRank: cuuint32_t,
globalAddress: *mut c_void,
globalDim: *const cuuint64_t,
globalStrides: *const cuuint64_t,
pixelBoxLowerCorner: *const c_int,
pixelBoxUpperCorner: *const c_int,
channelsPerPixel: cuuint32_t,
pixelsPerColumn: cuuint32_t,
elementStrides: *const cuuint32_t,
interleave: CUtensorMapInterleave,
swizzle: CUtensorMapSwizzle,
l2Promotion: CUtensorMapL2promotion,
oobFill: CUtensorMapFloatOOBfill,
) -> CUresult
Expand description
\brief Create a tensor map descriptor object representing im2col memory region
Creates a descriptor for Tensor Memory Access (TMA) object specified by the parameters describing a im2col memory layout and returns it in \p tensorMap.
Tensor map objects are only supported on devices of compute capability 9.0 or higher. Additionally, a tensor map object is an opaque value, and, as such, should only be accessed through CUDA API calls.
The parameters passed are bound to the following requirements:
-
\p tensorMap address must be aligned to 64 bytes.
-
\p tensorDataType has to be an enum from ::CUtensorMapDataType which is defined as: \code typedef enum CUtensorMapDataType_enum { CU_TENSOR_MAP_DATA_TYPE_UINT8 = 0, // 1 byte CU_TENSOR_MAP_DATA_TYPE_UINT16, // 2 bytes CU_TENSOR_MAP_DATA_TYPE_UINT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_INT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_UINT64, // 8 bytes CU_TENSOR_MAP_DATA_TYPE_INT64, // 8 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT16, // 2 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT64, // 8 bytes CU_TENSOR_MAP_DATA_TYPE_BFLOAT16, // 2 bytes CU_TENSOR_MAP_DATA_TYPE_FLOAT32_FTZ, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_TFLOAT32, // 4 bytes CU_TENSOR_MAP_DATA_TYPE_TFLOAT32_FTZ // 4 bytes } CUtensorMapDataType; \endcode
-
\p tensorRank, which specifies the number of tensor dimensions, must be 3, 4, or 5.
-
\p globalAddress, which specifies the starting address of the memory region described, must be 32 byte aligned when \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_32B and 16 byte aligned otherwise.
-
\p globalDim array, which specifies tensor size of each of the \p tensorRank dimensions, must be non-zero and less than or equal to 2^32.
-
\p globalStrides array, which specifies tensor stride of each of the lower \p tensorRank - 1 dimensions in bytes, must be a multiple of 16 and less than 2^40. Additionally, the stride must be a multiple of 32 when \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_32B. Each following dimension specified includes previous dimension stride: \code globalStrides[0] = globalDim[0] * elementSizeInBytes(tensorDataType) + padding[0]; for (i = 1; i < tensorRank - 1; i++) globalStrides[i] = globalStrides[i – 1] * (globalDim[i] + padding[i]); assert(globalStrides[i] >= globalDim[i]); \endcode
-
\p pixelBoxLowerCorner array specifies the coordinate offsets {D, H, W} of the bounding box from top/left/front corner. The number of offsets and their precision depend on the tensor dimensionality:
- When \p tensorRank is 3, one signed offset within range [-32768, 32767] is supported.
- When \p tensorRank is 4, two signed offsets each within range [-128, 127] are supported.
- When \p tensorRank is 5, three offsets each within range [-16, 15] are supported.
-
\p pixelBoxUpperCorner array specifies the coordinate offsets {D, H, W} of the bounding box from bottom/right/back corner. The number of offsets and their precision depend on the tensor dimensionality:
- When \p tensorRank is 3, one signed offset within range [-32768, 32767] is supported.
- When \p tensorRank is 4, two signed offsets each within range [-128, 127] are supported.
- When \p tensorRank is 5, three offsets each within range [-16, 15] are supported. The bounding box specified by \p pixelBoxLowerCorner and \p pixelBoxUpperCorner must have non-zero area.
-
\p channelsPerPixel, which specifies the number of elements which must be accessed along C dimension, must be less than or equal to 256.
-
\p pixelsPerColumn, which specifies the number of elements that must be accessed along the {N, D, H, W} dimensions, must be less than or equal to 1024.
-
\p elementStrides array, which specifies the iteration step along each of the \p tensorRank dimensions, must be non-zero and less than or equal to 8. Note that when \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_NONE, the first element of this array is ignored since TMA doesn’t support the stride for dimension zero. When all elements of the \p elementStrides array are one, \p boxDim specifies the number of elements to load. However, if \p elementStrides[i] is not equal to one for some \p i, then TMA loads ceil( \p boxDim[i] / \p elementStrides[i]) number of elements along i-th dimension. To load N elements along i-th dimension, \p boxDim[i] must be set to N * \p elementStrides[i].
-
\p interleave specifies the interleaved layout of type ::CUtensorMapInterleave, which is defined as: \code typedef enum CUtensorMapInterleave_enum { CU_TENSOR_MAP_INTERLEAVE_NONE = 0, CU_TENSOR_MAP_INTERLEAVE_16B, CU_TENSOR_MAP_INTERLEAVE_32B } CUtensorMapInterleave; \endcode TMA supports interleaved layouts like NC/8HWC8 where C8 utilizes 16 bytes in memory assuming 2 byte per channel or NC/16HWC16 where C16 uses 32 bytes. When \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_NONE and \p swizzle is not ::CU_TENSOR_MAP_SWIZZLE_NONE, the bounding box inner dimension (computed as \p boxDim[0] multiplied by element size derived from \p tensorDataType) must be less than or equal to the swizzle size.
- CU_TENSOR_MAP_SWIZZLE_32B implies the bounding box inner dimension will be <= 32.
- CU_TENSOR_MAP_SWIZZLE_64B implies the bounding box inner dimension will be <= 64.
- CU_TENSOR_MAP_SWIZZLE_128B implies the bounding box inner dimension will be <= 128.
-
\p swizzle, which specifies the shared memory bank swizzling pattern, has to be of type ::CUtensorMapSwizzle which is defined as: \code typedef enum CUtensorMapSwizzle_enum { CU_TENSOR_MAP_SWIZZLE_NONE = 0, CU_TENSOR_MAP_SWIZZLE_32B, CU_TENSOR_MAP_SWIZZLE_64B, CU_TENSOR_MAP_SWIZZLE_128B } CUtensorMapSwizzle; \endcode Data are organized in a specific order in global memory; however, this may not match the order in which the application accesses data in shared memory. This difference in data organization may cause bank conflicts when shared memory is accessed. In order to avoid this problem, data can be loaded to shared memory with shuffling across shared memory banks. When \p interleave is ::CU_TENSOR_MAP_INTERLEAVE_32B, \p swizzle must be ::CU_TENSOR_MAP_SWIZZLE_32B. Other interleave modes can have any swizzling pattern.
-
\p l2Promotion specifies L2 fetch size which indicates the byte granularity at which L2 requests are filled from DRAM. It must be of type ::CUtensorMapL2promotion, which is defined as: \code typedef enum CUtensorMapL2promotion_enum { CU_TENSOR_MAP_L2_PROMOTION_NONE = 0, CU_TENSOR_MAP_L2_PROMOTION_L2_64B, CU_TENSOR_MAP_L2_PROMOTION_L2_128B, CU_TENSOR_MAP_L2_PROMOTION_L2_256B } CUtensorMapL2promotion; \endcode
-
\p oobFill, which indicates whether zero or a special NaN constant should be used to fill out-of-bound elements, must be of type ::CUtensorMapFloatOOBfill which is defined as: \code typedef enum CUtensorMapFloatOOBfill_enum { CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE = 0, CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA } CUtensorMapFloatOOBfill; \endcode Note that ::CU_TENSOR_MAP_FLOAT_OOB_FILL_NAN_REQUEST_ZERO_FMA can only be used when \p tensorDataType represents a floating-point data type.
\param tensorMap - Tensor map object to create \param tensorDataType - Tensor data type \param tensorRank - Dimensionality of tensor; must be at least 3 \param globalAddress - Starting address of memory region described by tensor \param globalDim - Array containing tensor size (number of elements) along each of the \p tensorRank dimensions \param globalStrides - Array containing stride size (in bytes) along each of the \p tensorRank - 1 dimensions \param pixelBoxLowerCorner - Array containing DHW dimensions of lower box corner \param pixelBoxUpperCorner - Array containing DHW dimensions of upper box corner \param channelsPerPixel - Number of channels per pixel \param pixelsPerColumn - Number of pixels per column \param elementStrides - Array containing traversal stride in each of the \p tensorRank dimensions \param interleave - Type of interleaved layout the tensor addresses \param swizzle - Bank swizzling pattern inside shared memory \param l2Promotion - L2 promotion size \param oobFill - Indicate whether zero or special NaN constant will be used to fill out-of-bound elements
\return ::CUDA_SUCCESS, ::CUDA_ERROR_DEINITIALIZED, ::CUDA_ERROR_NOT_INITIALIZED, ::CUDA_ERROR_INVALID_CONTEXT, ::CUDA_ERROR_INVALID_VALUE
\sa ::cuTensorMapEncodeTiled, ::cuTensorMapReplaceAddress