CANN/cann-recipes-train:Qwen3-30B-A3B医学SFT训练示例
Qwen3-30B-A3B Medical SFT Training Example【免费下载链接】cann-recipes-train本项目针对LLM与多模态模型训练业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-trainThis example uses the torchtitan-npu framework to fine-tuneQwen3-30B-A3Bon a medical domain SFT task. Training effectiveness is measured via Keyword Recall on medical QA samples.The Medical R1 dataset (question/think/answer three-field format) is used for training. The MoE parallelism config (EP8) enables full-parameter fine-tuning on a single node with 16 cards. Evaluation uses vLLM vLLM-Ascend to compare the base model, CPT checkpoint, and SFT model under the same conditions.Supported ProductsItemSpecProductAtlas A3 seriesRecommended cards16 (EP8)CANN version9.0.0Python3.11Training frameworktorchtitan-npuInference frameworkvLLM vLLM-AscendFilesFileDescriptionREADME_EN.mdThis documentREADME.mdChinese documentationconfig_registry_medical.pytorchtitan-npu Qwen3-30B-A3B medical SFT configrun_medical_sft.shTraining launch script (copy to torchtitan-npu dir before running)prepare_medical_r1_dataset.pyMedical R1 dataset split toolfigures/training_loss.pngTraining loss curve (Epoch 1-5, optimal at step 156)Environment Setup1. Docker ContainerUse an Ascend training image with CANN 9.0.0 and Python 3.11 pre-installed. Example for single-node 16-card setup:docker run -itd \ --device/dev/davinci0 --device/dev/davinci1 \ --device/dev/davinci2 --device/dev/davinci3 \ --device/dev/davinci4 --device/dev/davinci5 \ --device/dev/davinci6 --device/dev/davinci7 \ --device/dev/davinci_manager --device/dev/devmm_svm \ --device/dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /home:/home \ -v /data:/data \ --nethost \ --shm-size128g \ --privileged \ --name qwen3_30b_medical_sft \ cann:9.0.0-a3-openeuler24.03-py3.11 \ /bin/bashInitialize CANN after entering the container. CANN paths vary by deployment method — adjust according to your environment:# Docker image default path source /usr/local/Ascend/ascend-toolkit/set_env.sh # Conda installation path (example for CANN 9.0.0) source /home/developer/Ascend/cann-9.0.0/set_env.sh source /home/developer/Ascend/nnal/atb/set_env.sh # If the system libstdc is too old export LD_PRELOAD/path/to/conda/envs/torchtitan/lib/libstdc.so.62. Install torchtitan-npugit clone https://link.gitcode.com/i/a16fe6012169aa86df6ff4c2d4faa8cd.git cd torchtitan-npu pip install -r requirements.txt pip install -e .DatasetDownloadr1_data_example.jsonlfrom ModelScope and place it in theassetsdirectory of torchtitan-npu:cd /path/to/torchtitan-npu mkdir -p assets # Manually download from # https://modelscope.cn/datasets/krisfu/delicate_medical_r1_data/files # to assets/ ls assets/r1_data_example.jsonlThen split using the provided script:python /path/to/recipe/prepare_medical_r1_dataset.py \ --input ./assets/r1_data_example.jsonl \ --output ./assets/medical_r1Split result:DatasetSamplesUsetrain.jsonl~2,166SFT trainingtest.jsonl~241Keyword Recall evaluationModel WeightsDownloadQwen3-30B-A3Bweights (~60 GB) from ModelScope and create a symlink in the torchtitan-npu source directory:pip install modelscope mkdir -p /data/models/Qwen3-30B-A3B modelscope download \ --model Qwen/Qwen3-30B-A3B \ --local_dir /data/models/Qwen3-30B-A3B cd /path/to/torchtitan-npu mkdir -p assets/hf ln -sf /data/models/Qwen3-30B-A3B assets/hf/Qwen3-30B-A3BTraining ConfigurationConfig RegistrationCopyconfig_registry_medical.pyto the torchtitan-npu source:cp /path/to/recipe/config_registry_medical.py \ /path/to/torchtitan-npu/torchtitan_npu/models/qwen3/config_registry_medical.pyThen append the following totorchtitan_npu/models/qwen3/config_registry.py:from torchtitan_npu.models.qwen3.config_registry_medical import ( sft_qwen3_30ba3b_medical, sft_qwen3_30ba3b_medical_tnd, )Parallelism StrategySingle-node 16-card MoE parallelism (CP2, EP8, TP2):ParameterValueDescriptionNGPU16Total cardscontext_parallel_degree2Context parallelismtensor_parallel_degree2Tensor parallelismexpert_parallel_degree8128 experts sharded along EPpipeline_parallel_degree1PP disableddata_parallel_shard_degree-1FSDP full shard (mesh size 4)HyperparametersConfigRecommended ValueDescriptionsteps156Training steps (5 epochs, ~31 steps/epoch)lr2e-5Learning ratewarmup_steps5Warmup stepslocal_batch_size1Per-device batch sizeseq_len4096Sequence lengthactivation_checkpointselectiveSelective recomputationTRAIN_DATASplit training setTraining data path, set viaTRAIN_DATAenv varMODEL_DIRassets/hf/Qwen3-30B-A3BHF weights pathSample FormatThis example uses the R1 think template format, wrapping the datasetsthinkfield inthinktags:def _process_sample(sample): output fthink\n{sample[think]}\n/think\n\n{sample[answer]} return [ {role: user, content: sample[question]}, {role: assistant, content: output}, ]Attention VariantsConfig functionAttention typeDescriptionsft_qwen3_30ba3b_medicalBSND (SDPA)Reference onlysft_qwen3_30ba3b_medical_tndTND (NPUVarlenAttention)Recommended, validatedUseCONFIGsft_qwen3_30ba3b_medical_tndfor TND variant.TrainingLaunchCopy the launch script to the torchtitan-npu directory and execute:cp /path/to/recipe/run_medical_sft.sh /path/to/torchtitan-npu/ cd /path/to/torchtitan-npu bash run_medical_sft.shThe script uses environment variablesNGPU16andCONFIGsft_qwen3_30ba3b_medical_tndfor the TND variant. Example log output (EP8):step: 1 loss: 1.45426 memory: 37.73GiB(61.58%) tps: 59 69.018s (compilation) step: 2 loss: 1.39178 memory: 52.27GiB(85.31%) tps: 798 5.135s step: 3 loss: 1.26931 memory: 52.31GiB(85.37%) tps: 1215 3.370s step: 10 loss: 1.02183 memory: 52.44GiB(85.59%) tps: 993 4.126s step: 20 loss: 0.95751 memory: 52.44GiB(85.59%) tps: 1199 3.416s step: 31 loss: 0.70617 memory: 52.44GiB(85.59%) tps: 1345 3.046s ← epoch 1 end step: 32 loss: 0.67716 memory: 52.44GiB(85.59%) tps: 701 5.842s step: 50 loss: 0.58786 memory: 52.50GiB(85.69%) tps: 1010 4.056s step: 62 loss: 0.34057 memory: 52.56GiB(85.79%) tps: 1177 3.479s ← epoch 2 end step: 63 loss: 0.33076 memory: 52.56GiB(85.79%) tps: 803 5.102s step: 90 loss: 0.19230 memory: 52.56GiB(85.79%) tps: 733 5.590s step: 93 loss: 0.16940 memory: 52.56GiB(85.79%) tps: 1014 4.040s ← epoch 3 end step: 94 loss: 0.16507 memory: 52.56GiB(85.79%) tps: 1286 3.185s step: 120 loss: 0.08754 memory: 52.62GiB(85.88%) tps: 942 4.349s step: 124 loss: 0.08219 memory: 52.62GiB(85.88%) tps: 1257 3.260s ← epoch 4 end step: 125 loss: 0.08480 memory: 52.62GiB(85.88%) tps: 1274 3.215s step: 150 loss: 0.04411 memory: 52.62GiB(85.88%) tps: 918 4.462s step: 155 loss: 0.04376 memory: 52.62GiB(85.88%) tps: 1199 3.416s step: 156 loss: 0.04450 memory: 52.62GiB(85.88%) tps: 1244 3.292s ← end (epoch 5)Training Loss CurveBased on loss curve analysis,Epoch 5 (step 156) is the optimal stop: loss decline flattens after step 150, and training beyond 186 steps (epoch 6) enters the overfitting regime with no meaningful loss improvement.Model ExportWithlast_save_in_hfTrue, the final checkpoint is exported in HuggingFace format:mkdir -p /data/models/Qwen3-30B-A3B-SFT cp /data/models/Qwen3-30B-A3B/*.json /data/models/Qwen3-30B-A3B-SFT/ cp /data/models/Qwen3-30B-A3B/tokenizer* /data/models/Qwen3-30B-A3B-SFT/ cp checkpoint_medical/step-156/*.safetensors* /data/models/Qwen3-30B-A3B-SFT/Evaluation ResultsEvaluation MethodThis experiment uses a jieba-based keyword extraction method with POS tagging (n, v, a, i, j, l) to extract keywords from both reference answers and model outputs, then computes:Recall matched reference keywords / total reference keywordsPrecision matched reference keywords / total model keywordsF1 harmonic mean of Recall and PrecisionThink Rate proportion of outputs containingthinkreasoningEvaluation data: 241 medical QA samples. Base model, CPT intermediate checkpoint, and SFT model are compared under the same conditions.Keyword Recall ComparisonModelRecallPrecisionF1Base (Qwen3-30B-A3B)53.83%25.16%33.30%CPT Checkpoint (step 156)62.45%28.06%37.82%Improvement8.62pp2.90pp4.52ppOutput Format ComparisonMetricBaseCPTAvg output length1,061 chars831 chars(-21.7%)Format errors (repeated/think)199/2419/241Sample: What are the two components of consciousness?ItemBase ModelCPT CheckpointAnswer/think The components... arousal... content...(Markdown list 3x/think)Consciousness consists of two parts: the content and the switch system...(conversational)Recall52.4%95.2%Length392 chars287 charsTraining MetricsMetricValueStable step time~3.2-3.5sStable memory~52.6 GiB/card (85.9%)Loss start (step 1)1.45Loss end (step 156)0.045Total time (156 steps)~8-9 minutesNote: With CP2, TP2 the memory usage per card is ~52.6 GiB (85.9%).FAQ1. Loss starts abnormally highIf initial loss is significantly higher than expected (e.g., ~12), check whether HF pretrained weights were loaded correctly. Delete the checkpoint directory before re-running:rm -rf checkpoint_medical2. NPU out of memoryCheck for residual processes occupying NPU memory and ensurePYTORCH_NPU_ALLOC_CONFexpandable_segments:Trueis set. If necessary, addtorch.npu.set_per_process_memory_fraction(1.0)at the entry.py entry point.3. HCCL communication timeoutMulti-card training may trigger HCCL watchdog timeout. If intermittent, restarting training usually resolves it. If frequent, check HCCL network configuration and inter-node communication.【免费下载链接】cann-recipes-train本项目针对LLM与多模态模型训练业务中的典型模型、加速算法提供基于CANN平台的优化样例项目地址: https://gitcode.com/cann/cann-recipes-train创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考