OptimizedMatmul Example Readme【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlassCode Organization├── 06_optimized_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── optimized_matmul.cpp # Main fileFunctionThis example demonstrates optimized matrix multiplication. Compared to the00_basic_matmulexample , this implementation replaces the dispatch policy withMmadAtlasA2Preloadand introduces padding preprocessing for the input matrices to improve data transfer performance.ExampleAfter obtaining the code, compile the operator executable file. For details, see Template Library Quick Start.Execute the operator.# Compile a specified test case. bash scripts/build.sh 06_optimized_matmul cd output/bin # Executable file name | Matrix M-axis | N-axis | K-axis | Device ID # The device ID is optional. The default value is 0. ./06_optimized_matmul 256 512 1024 0If the following result is displayed, precision verification is successful.Compare success.RemarksIn this example, the default padding action usesPADDING_NZ. You can switch this toPADDING_BLOCK_NDto evaluate alternative performance profiles.PADDING_NZThe code configuration is as follows:constexpr PaddingTag paddingTagA (std::is_same_vLayoutA, layout::zN || std::is_same_vLayoutA, layout::nZ) ? PaddingTag::NO_PADDING : PaddingTag::PADDING_NZ; constexpr PaddingTag paddingTagB (std::is_same_vLayoutB, layout::zN || std::is_same_vLayoutB, layout::nZ) ? PaddingTag::NO_PADDING : PaddingTag::PADDING_NZ;TheCOMPUTE_LENGTHallocated in the UB under thePADDING_NZpolicy is 48 KB:static const uint32_t COMPUTE_LENGTH_A 48 * 1024 / sizeof(ElementA); static const uint32_t COMPUTE_LENGTH_B 48 * 1024 / sizeof(ElementB);PADDING_BLOCK_NDThe modifications required to enablePADDING_BLOCK_NDare shown below. When the input matrix is not in NZ format, this policy aligns and pads the matrix according toL1TileShape:constexpr PaddingTag paddingTagA (std::is_same_vLayoutA, layout::zN || std::is_same_vLayoutA, layout::nZ) ? PaddingTag::NO_PADDING - : PaddingTag::PADDING_NZ; : PaddingTag::PADDING_BLOCK_ND; constexpr PaddingTag paddingTagB (std::is_same_vLayoutB, layout::zN || std::is_same_vLayoutB, layout::nZ) ? PaddingTag::NO_PADDING - : PaddingTag::PADDING_NZ; : PaddingTag::PADDING_BLOCK_ND;TheCOMPUTE_LENGTHallocated in the UB scales up to 96 KB under thePADDING_BLOCK_NDpolicy:-static const uint32_t COMPUTE_LENGTH_A 48 * 1024 / sizeof(ElementA); -static const uint32_t COMPUTE_LENGTH_B 48 * 1024 / sizeof(ElementB); static const uint32_t COMPUTE_LENGTH_A 96 * 1024 / sizeof(ElementA); static const uint32_t COMPUTE_LENGTH_B 96 * 1024 / sizeof(ElementB);【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考