基于vLLM-Ascend的Qwen3.5-397B模型Atlas 800I A2单机混部部署实践
作者昇腾实战派知识地图https://blog.csdn.net/Lumos_Lovegood/article/details/161601003背景概述本文档将介绍基于vLLM-Ascend的Qwen3.5-397B模型在Atlas 800I A2上的单机混部部署实践包括支持的特性、特性配置、环境信息以及性能测试典型case。基本信息软件版本设备信息组网形态总卡数数据格式0.18.0NPUAtlas 800I A2-280T, HBM 64GCPUKunpeng 92048核-2600MHz内存24根32G3200MHZOSUbuntu 22.04 LTSAtlas 800I A2单机8W4A8C16服务化配置低时延/高吞吐exportASCEND_RT_VISIBLE_DEVICES0,1,2,3,4,5,6,7exportPYTORCH_NPU_ALLOC_CONFexpandable_segments:TrueexportHCCL_IF_IPxxxexportHCCL_OP_EXPANSION_MODEAIVexportHCCL_BUFFSIZE1024exportOMP_NUM_THREADS1exportLD_PRELOAD/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOADexportTASK_QUEUE_ENABLE1exportVLLM_ASCEND_ENABLE_FUSED_MC20exportVLLM_ASCEND_ENABLE_FLASHCOMM11vllm serve /home/Qwen3.5-397B-A17B-w4a8-mtp\--served-model-nameqwen35-397b-zz\--hostxxx\--port10888\--data-parallel-size1\--tensor-parallel-size8\--max-model-len133120\--max-num-batched-tokens16384\--max-num-seqs128\--gpu-memory-utilization0.9\--enable-expert-parallel\--compilation-config{cudagraph_capture_sizes:[1,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,108,112,128,160,172,196,200,212,232,256,260,288,320,360,400], cudagraph_mode:FULL_DECODE_ONLY}\--speculative_config{method: qwen3_5_mtp, num_speculative_tokens: 3, enforce_eager: true}\--trust-remote-code\--no-enable-prefix-caching\--async-scheduling\--allowed-local-media-path /\--quantizationascend\--mm-processor-cache-gb0\--additional-config{enable_cpu_binding:true, multistream_overlap_shared_expert: true}典型测试用例平均输入平均输出并行策略上下文长度Prefix Cache命中率总请求数最大并发数请求频率(req/s)20482048MLADP1TP8262144028872020482048MLADP1TP826214405614035001500MLADP1TP8262144025664035001500MLADP1TP8262144048120163841024MLADP1TP8262144080200163841024MLADP1TP82621440205032768512MLADP1TP82621440328032768512MLADP1TP826214401230655361024MLADP1TP826214402870655361024MLADP1TP826214408201310721024MLADP1TP8262144016401310721024MLADP1TP82621440410测试命令参考aisbench官方测试指南。aisbench测试命令vllm-ascend社区官网特别声明以上配置均未开启Prefix Cache若实际生产环境需要使用该特性参考vLLM-Ascend社区参数指南开启–enable-prefix-caching