Qwen2.5-Coder-32B-Instruct-AWQ模型部署
1.系统环境NVIDIA T4 * 2 /16G * 2 Driver Version: 535.154.05 CUDA Version: 12.2Qwen/Qwen2.5-Coder-32B-Instruct-AWQ2.vllm镜像下载使用vllm加载模型dockerpull vllm/vllm-openai:latest3.模型下载阿里魔搭社区https://www.modelscope.cn/models使用vllm容器下载dockerrun--rm-it\--gpusall\--entrypoint/bin/bash\--pids-limit-1\--security-optseccompunconfined\-v/root/lipengcheng/qwen2532ia:/models\-eOMP_NUM_THREADS8\vllm/vllm-openai:latest\-cpip install modelscope python3 -c\from modelscope import snapshot_download; snapshot_download(Qwen/Qwen2.5-Coder-32B-Instruct-AWQ, cache_dir/models)\4.加载Qwen2.5-Coder-32B-Instruct-AWQ模型dockerrun--gpusall-d-p8000:8000--nameqwen2.5-coder32\--ipchost\--pids-limit-1\--security-optseccompunconfined\-v/root/lipengcheng/qwen2532ia/Qwen/Qwen2___5-Coder-32B-Instruct-AWQ:/model\-eHF_DATASETS_OFFLINE1\-eTRANSFORMERS_OFFLINE1\-eOMP_NUM_THREADS16\vllm/vllm-openai:latest\--model/model\--tensor-parallel-size2\--max-model-len16384\--gpu-memory-utilization0.9\--trust-remote-code看到如下日志就说明加载成功了5.模型测试测试命令curlhttp://localhost:8000/v1/chat/completions\-HContent-Type: application/json\-d{ model: /model, messages: [{role: user, content: 你好}] }返回内容{id:chatcmpl-bf4f4555eeceea94,object:chat.completion,created:1778649567,model:/model,choices:[{index:0,message:{role:assistant,content:你好有什么我可以帮忙的吗,refusal:null,annotations:null,audio:null,function_call:null,tool_calls:[],reasoning:null},logprobs:null,finish_reason:stop,stop_reason:null,token_ids:null}],service_tier:null,system_fingerprint:null,usage:{prompt_tokens:30,total_tokens:39,completion_tokens:9,prompt_tokens_details:null},prompt_logprobs:null,prompt_token_ids:null,kv_transfer_params:null}