CANN/catlass GEMM内核开发详解-尧图建网站

GEMM Kernel Code Explained【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass1. Kernel Code Structure OverviewThe GEMM Kernel in the CATLASS template library adopts a highly modular design. It assembles different components through template parameters to implement various matrix multiplication operations. This document usesBasicMatmulas an example to break down the core structure and key components of the Kernel code.2. Template Assembly MechanismAll GEMM Kernels are defined in the form of template classes, which assemble different functional components through template parameters. TakeBasicMatmulas an example:template class BlockMmad_, class BlockEpilogue_, class BlockScheduler_ class BasicMatmul { public: using BlockMmad BlockMmad_; using ArchTag typename BlockMmad::ArchTag; using L1TileShape typename BlockMmad::L1TileShape; using ElementA typename BlockMmad::ElementA; using LayoutA typename BlockMmad::LayoutA; using ElementB typename BlockMmad::ElementB; using LayoutB typename BlockMmad::LayoutB; using ElementC typename BlockMmad::ElementC; using LayoutC typename BlockMmad::LayoutC; using ElementAccumulator typename BlockMmad::ElementAccumulator; using BlockScheduler BlockScheduler_; // ... };2.1 Core Template ParametersParameterDescriptionBlockMmad_The core computation component responsible for matrix multiplicationBlockEpilogue_Responsible for epilogues of the computation results (e.g., activation functions, quantization)BlockScheduler_Responsible for scheduling and distributing computational tasks to different compute cores2.2 Type ExportThe types exported through the template parameters form the Kernels core type system, which includes:Architecture tag (ArchTag)L1 cache tile shape (L1TileShape)Data types (ElementA/B/C/Accumulator)Data layouts (LayoutA/B/C)3. Parameter Passing MechanismThe Kernel uses a two-layer parameter structure:Arguments(user interface layer) andParams(kernel execution layer).3.1 ArgumentsArgumentsis the parameter structure used directly by users. It contains the most basic input and output information:struct Arguments { GemmCoord problemShape; GM_ADDR ptrA; GM_ADDR ptrB; GM_ADDR ptrC; };3.2 ParamsParamsis the parameter structure used during actual kernel execution. It contains more detailed execution information:struct Params { // Data members GemmCoord problemShape; GM_ADDR ptrA; LayoutA layoutA; GM_ADDR ptrB; LayoutB layoutB; GM_ADDR ptrC; LayoutC layoutC; // Methods CATLASS_HOST_DEVICE Params() {} CATLASS_HOST_DEVICE Params(GemmCoord const problemShape_, GM_ADDR ptrA_, LayoutA layoutA_, GM_ADDR ptrB_, LayoutB layoutB_, GM_ADDR ptrC_, LayoutC layoutC_) : problemShape(problemShape_), ptrA(ptrA_), layoutA(layoutA_), ptrB(ptrB_), layoutB(layoutB_), ptrC(ptrC_), layoutC(layoutC_) {} };3.3 Parameter ConversionTheToUnderlyingArgumentsfunction convertsArgumentstoParams:static Params ToUnderlyingArguments(const Arguments args, uint8_t *workspace) { LayoutA layoutA{args.problemShape.m(), args.problemShape.k()}; LayoutB layoutB{args.problemShape.k(), args.problemShape.n()}; LayoutC layoutC{args.problemShape.m(), args.problemShape.n()}; Params params{args.problemShape, args.ptrA, layoutA, args.ptrB, layoutB, args.ptrC, layoutC}; return params; }4. Key Functions4.1 CanImplementChecks whether the current hardware and environment support the implementation of this Kernel:static bool CanImplement(const Arguments args) { return true; }4.2 GetWorkspaceSizeGets the workspace size required for Kernel execution:static size_t GetWorkspaceSize(const Arguments args) { return 0; }4.3 operator()This is the Kernels core execution function. It supports different core types (such as AIC, AIV) through template specialization:template int32_t CORE_TYPE g_coreType CATLASS_DEVICE void operator()(Params const params); /// Executes one Matmul template CATLASS_DEVICE void operator()AscendC::AIC(Params const params) { BlockScheduler matmulBlockScheduler(params.problemShape, MakeCoord(L1TileShape::M, L1TileShape::N)); uint32_t coreLoops matmulBlockScheduler.GetCoreLoops(); Arch::ResourceArchTag resource; BlockMmad blockMmad(resource); // Represent the full gm AscendC::GlobalTensorElementA gmA; gmA.SetGlobalBuffer((__gm__ ElementA *)params.ptrA); AscendC::GlobalTensorElementB gmB; gmB.SetGlobalBuffer((__gm__ ElementB *)params.ptrB); AscendC::GlobalTensorElementC gmC; gmC.SetGlobalBuffer((__gm__ ElementC *)params.ptrC); for (uint32_t loopIdx AscendC::GetBlockIdx(); loopIdx coreLoops; loopIdx AscendC::GetBlockNum()) { // Compute block location GemmCoord blockCoord matmulBlockScheduler.GetBlockCoord(loopIdx); GemmCoord actualBlockShape matmulBlockScheduler.GetActualBlockShape(blockCoord); // Compute initial location in logical coordinates MatrixCoord offsetA{blockCoord.m() * L1TileShape::M, blockCoord.k() * L1TileShape::K}; MatrixCoord offsetB{blockCoord.k() * L1TileShape::K, blockCoord.n() * L1TileShape::N}; MatrixCoord offsetC{blockCoord.m() * L1TileShape::M, blockCoord.n() * L1TileShape::N}; int64_t gmOffsetA params.layoutA.GetOffset(offsetA); int64_t gmOffsetB params.layoutB.GetOffset(offsetB); int64_t gmOffsetC params.layoutC.GetOffset(offsetC); // Compute block-scoped matrix multiply-add blockMmad(gmA[gmOffsetA], params.layoutA, gmB[gmOffsetB], params.layoutB, gmC[gmOffsetC], params.layoutC, actualBlockShape); } AscendC::PipeBarrierPIPE_ALL(); }5. Execution Flow AnalysisThe Kernels execution flow divides into the following steps:5.1 Initializing the SchedulerBlockScheduler matmulBlockScheduler(params.problemShape, MakeCoord(L1TileShape::M, L1TileShape::N)); uint32_t coreLoops matmulBlockScheduler.GetCoreLoops();5.2 Initializing Resources and Compute ComponentsArch::ResourceArchTag resource; BlockMmad blockMmad(resource);5.3 Setting Global Memory TensorsAscendC::GlobalTensorElementA gmA; gmA.SetGlobalBuffer((__gm__ ElementA *)params.ptrA); // Set gmB and gmC...5.4 Looping Through Each Compute Blockfor (uint32_t loopIdx AscendC::GetBlockIdx(); loopIdx coreLoops; loopIdx AscendC::GetBlockNum()) { // 1. Compute block coordinates. GemmCoord blockCoord matmulBlockScheduler.GetBlockCoord(loopIdx); GemmCoord actualBlockShape matmulBlockScheduler.GetActualBlockShape(blockCoord); // 2. Compute memory offsets. MatrixCoord offsetA{blockCoord.m() * L1TileShape::M, blockCoord.k() * L1TileShape::K}; // Compute offsetB and offsetC... int64_t gmOffsetA params.layoutA.GetOffset(offsetA); // Compute gmOffsetB and gmOffsetC... // 3. Execute block-level matrix multiplication. blockMmad(gmA[gmOffsetA], params.layoutA, gmB[gmOffsetB], params.layoutB, gmC[gmOffsetC], params.layoutC, actualBlockShape); }5.5 SynchronizationAscendC::PipeBarrierPIPE_ALL();6. Extensions and Differences Among KernelsBy comparingBasicMatmul,BatchedMatmul,QuantMatmul, andOptimizedMatmul, you can see their commonalities and differences in the base structure:6.1 BatchedMatmul ExtensionBatchedMatmuladds batch processing support toBasicMatmul:struct Params { // Data members uint32_t batchCount; // Added batch count GemmCoord problemShape; GM_ADDR ptrA; LayoutA layoutA; int64_t strideA; // Added batch stride for matrix A GM_ADDR ptrB; LayoutB layoutB; int64_t strideB; // Added batch stride for matrix B GM_ADDR ptrC; LayoutC layoutC; int64_t strideC; // Added batch stride for matrix C // ... };6.2 QuantMatmul ExtensionQuantMatmuladds quantization-related parameters and processing:struct Params { // Data members GemmCoord problemShape; __gm__ ElementA *ptrA; LayoutA layoutA; __gm__ ElementB *ptrB; LayoutB layoutB; __gm__ ElementScale *ptrScale; // Added scale parameters LayoutScale layoutScale; __gm__ ElementPerTokenScale *ptrPerTokenScale; // Added per-token scale parameters LayoutPerTokenScale layoutPerTokenScale; __gm__ ElementD *ptrD; // Added output matrix D LayoutD layoutD; GM_ADDR ptrWorkspace; // Added workspace // ... };6.3 OptimizedMatmul ExtensionOptimizedMatmuladds prologue processing and a more complex parameter structure:template class PrologueA, // Added prologue for matrix A class PrologueB, // Added prologue for matrix B class BlockMmad_, class BlockEpilogue_, class BlockScheduler_ class OptimizedMatmul { // ... templatebool IsPaddingA true, bool IsPaddingB true struct KernelParams : public ParamsBase { // Added padding-related parameters GM_ADDR ptrWA; LayoutWA layoutWA; GM_ADDR ptrWB; LayoutWB layoutWB; // ... }; // ... };7. SummaryThe CATLASS GEMM Kernel adopts a highly modular and template-based design with the following characteristics:Template assembly: Flexibly assembles different functional components through template parameters, enabling code reuse and function extension.Layered parameters: Uses Arguments and Params to separate the user interface from kernel execution parameters.Unified execution process: All Kernels follow a similar execution flow, including initialization, scheduling, computation, and synchronization.Scalability: By extending the base structure, developers can easily implement advanced features such as batch processing, quantization, and optimization.This design allows the CATLASS template library to efficiently support a wide range of GEMM operations while ensuring code maintainability and extensibility.【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

相关新闻

ComfyUI无缝集成：LTX-2.3-22b-IC-LoRA-Ingredients插件安装与配置终极指南

JoyAI-VL-Interaction-Preview技术架构深度解析：8B规模视觉优先模型的设计哲学

为什么选择Sing-Guard-8b-GGUF？六大安全基准测试表现全面领先

最新新闻

从蜘蛛侠绘画项目学习角色设计：动态、透视与材质表现系统训练

Llama4应用构建：基于DLAI范式的可监控生产流水线

挖矿木马攻击路径转向：Redis、Docker等非Web服务漏洞防御实战

大模型技术解析：从算法原理到微调部署实战指南

医疗知识图谱构建：COMED框架解析与应用实践

UI UX Pro Max：跨层诊断与意图对齐的高阶设计能力

日新闻

UVA10082 WERTYU（洛谷-UVA10082）

2026怎么选能支持多流派解盘逻辑的AI辅助解盘工具？资深专家教你看懂底层算力

RAG 系统中「检索质量」与「生成质量」之间那道隐形的鸿沟，到底是怎么形成的？

周新闻

Google AI Studio 300美元额度的真相与实战指南

【人工智能】一文搞定到底什么是智能体

嵌入式GUI控件实战：ROTARY、SCROLLBAR、SLIDER原理与应用

月新闻