基于vLLM-Ascend的Qwen3.5-397B模型Atlas 800I A3单机混部部署实践
作者昇腾实战派知识地图https://blog.csdn.net/Lumos_Lovegood/article/details/161601003背景概述本文档将介绍基于vLLM-Ascend的Qwen3.5-397B模型在Atlas 800I A3上的单机混部部署实践包括支持的特性、特性配置、环境信息以及性能测试典型case。基本信息软件版本设备信息组网形态总卡数数据格式0.18.0NPU: Atlas 800I A3-560T, HBM 128GCPU: Kunpeng 920 (80核-2900MHz)内存: 32根64G5200MHzOS: OpenEuler 22.03 LTS-SP4Atlas 800I A3单机8W4A8C16服务化配置低时延exportASCEND_RT_VISIBLE_DEVICES0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15exportPYTORCH_NPU_ALLOC_CONFexpandable_segments:TrueexportHCCL_IF_IPxxxexportHCCL_OP_EXPANSION_MODEAIVexportHCCL_BUFFSIZE1024exportOMP_NUM_THREADS1echoperformance|tee/sys/devices/system/cpu/cpu*/cpufreq/scaling_governorsysctl-wvm.swappiness0sysctl-wkernel.numa_balancing0sysctlkernel.sched_migration_cost_ns50000exportLD_PRELOAD/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOADexportTASK_QUEUE_ENABLE1exportVLLM_ASCEND_ENABLE_FUSED_MC21vllm serve /mnt/share/weights/Qwen3.5-397B-A17B-w8a8-org/\--served-model-nameqwen3.5\--host0.0.0.0\--port8010\--data-parallel-size1\--tensor-parallel-size16\--enable-expert-parallel\--max-model-len5000\--max-num-batched-tokens16384\--max-num-seqs128\--gpu-memory-utilization0.9\--compilation-config{cudagraph_capture_sizes:[1,6,12,18,24,30,36,42,48,54,72,78,84,90,96,102,108,144,192], cudagraph_mode:FULL_DECODE_ONLY}\--speculative_config{method: qwen3_5_mtp, num_speculative_tokens: 3, enforce_eager: true}\--trust-remote-code\--async-scheduling\--allowed-local-media-path /\--quantizationascend\--mm-processor-cache-gb0\--additional-config{enable_cpu_binding:true}典型测试用例平均输入平均输出并行策略上下文长度Prefix Cache命中率总请求数最大并发数请求频率(req/s)20482048MLADP1TP165000012832035001500MLADP1TP168000080200163841024MLADP1TP16184320369032768512MLADP1TP163430401640655361024MLADP1TP166758408201310721024MLADP1TP161331200820高吞吐exportASCEND_RT_VISIBLE_DEVICES0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15exportPYTORCH_NPU_ALLOC_CONFexpandable_segments:TrueexportHCCL_IF_IPxxxexportHCCL_OP_EXPANSION_MODEAIVexportHCCL_BUFFSIZE1024exportOMP_NUM_THREADS1echoperformance|tee/sys/devices/system/cpu/cpu*/cpufreq/scaling_governorsysctl-wvm.swappiness0sysctl-wkernel.numa_balancing0sysctlkernel.sched_migration_cost_ns50000exportLD_PRELOAD/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOADexportTASK_QUEUE_ENABLE1exportVLLM_ASCEND_ENABLE_FUSED_MC21exportVLLM_ASCEND_ENABLE_FLASHCOMM11vllm serve /mnt/share/weights/Qwen3.5-397B-A17B-w8a8-org/\--served-model-nameqwen3.5\--host0.0.0.0\--port8010\--data-parallel-size1\--tensor-parallel-size16\--enable-expert-parallel\--max-model-len5000\--max-num-batched-tokens16384\--max-num-seqs128\--gpu-memory-utilization0.9\--compilation-config{cudagraph_capture_sizes:[1,4,8,12,16,24,32,48,56,64,72,84,96,108,112,128,160,172,196,200,212,232,256,272,288,312,328,344,360,384,400,416,432,448,480,512], cudagraph_mode:FULL_DECODE_ONLY}\--speculative_config{method: qwen3_5_mtp, num_speculative_tokens: 3, enforce_eager: true}\--trust-remote-code\--async-scheduling\--allowed-local-media-path /\--quantizationascend\--mm-processor-cache-gb0\--additional-config{enable_cpu_binding:true}典型测试用例平均输入平均输出并行策略上下文长度Prefix Cache命中率总请求数最大并发数请求频率(req/s)20482048MLADP1TP16500002048512035001500MLADP1TP16800005121280163841024MLADP1TP1618432014436032768512MLADP1TP1634304048120655361024MLADP1TP1667584032801310721024MLADP1TP1613312001640测试命令参考aisbench官方测试指南。aisbench测试命令vllm-ascend社区官网特别声明以上配置均未开启Prefix Cache若实际生产环境需要使用该特性参考vLLM-Ascend社区参数指南开启–enable-prefix-caching