Google Research Football + CleanRL 单智能体训练环境配置记录
1. 环境定位本文记录 Google Research Football 和 CleanRL 的配置流程。该环境用于单智能体强化学习训练主要面向 GRF Academy 场景例如academy_empty_goal_close academy_empty_goal academy_run_to_score academy_run_to_score_with_keeper academy_pass_and_shoot_with_keeper本文配置目标是GRF 作为足球仿真环境 CleanRL 提供 PPO 单文件训练脚本 PyTorch 使用 CUDA 进行策略网络训练 TensorBoard 记录训练曲线 支持模型保存、模型加载、评估、视频和 dump 输出该环境不作为多智能体训练框架使用。后续多智能体训练应另建环境并接入 EPyMARL、MAPPO 官方实现或其他 MARL 框架。2. 已验证版本本文使用如下版本组合系统Ubuntu Linux Python3.8 gfootball2.10.2 gym0.23.1 gymnasium0.29.1 torch1.11.0cu113 torchvision0.12.0cu113 torchaudio0.11.0 tensorboard2.14.0 protobuf3.20.3 numpy1.24.4 scipy1.10.1 pygame2.1.2 opencv-python4.8.1.78 tyro0.7.3本机显卡驱动示例NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2注意nvidia-smi显示的 CUDA Version 表示驱动支持的最高 CUDA API 版本。本文安装的是 PyTorch 自带的 CUDA 11.3 runtime因此不需要系统单独安装 CUDA 11.3 Toolkit。3. 安装系统依赖Google Research Football 在 Linux 下需要编译和运行 C 游戏引擎因此需要安装 SDL2、Boost、OpenGL、Xvfb 等依赖。sudo apt-get update sudo apt-get install -y \ git cmake build-essential pkg-config \ libgl1-mesa-dev mesa-utils \ libsdl2-dev libsdl2-image-dev libsdl2-ttf-dev libsdl2-gfx-dev \ libboost-all-dev \ libdirectfb-dev libst-dev \ xvfb x11vnc \ python3-dev python3-pip4. 创建 Conda 环境conda create -n grf_cleanrl python3.8 -y conda activate grf_cleanrl确认 Python 和 pip 来自当前环境which python which pip python --version预期输出类似/home/hx/anaconda3/envs/grf_cleanrl/bin/python /home/hx/anaconda3/envs/grf_cleanrl/bin/pip Python 3.8.x5. 安装 GRF 基础依赖先固定 pip、setuptools 和 wheel 版本python -m pip install --upgrade pip23.3.2 setuptools65.5.0 wheel0.41.3安装 GRF 运行所需的 Python 包python -m pip install \ numpy1.24.4 \ scipy1.10.1 \ pygame2.1.2 \ opencv-python4.8.1.78 \ absl-py1.4.0 \ psutil5.9.8 \ cloudpickle2.2.1 \ gym0.23.1 \ six1.16.0安装 Google Research Footballpython -m pip install gfootball2.10.26. 测试 GRF 是否安装成功执行无渲染测试python - PY from gfootball.env import create_environment env create_environment( env_nameacademy_empty_goal_close, representationsimple115v2, renderFalse, ) obs env.reset() print(reset ok) print(obs type:, type(obs)) print(obs shape:, getattr(obs, shape, None)) action env.action_space.sample() obs, reward, done, info env.step(action) print(step ok) print(reward:, reward) print(done:, done) print(info type:, type(info)) env.close() PY正常情况下应看到reset ok step ok测试可视化窗口python -m gfootball.play_game --action_setfull远程服务器或无显示器环境可使用xvfb-run -s -screen 0 1280x720x24 python -m gfootball.play_game --action_setfull训练阶段默认不需要打开窗口。7. 安装 CUDA 版 PyTorch本文采用已经验证过的 PyTorch 版本torch1.11.0cu113 torchvision0.12.0cu113 torchaudio0.11.0安装命令python -m pip uninstall -y torch torchvision torchaudio python -m pip install \ torch1.11.0cu113 \ torchvision0.12.0cu113 \ torchaudio0.11.0 \ --extra-index-url https://download.pytorch.org/whl/cu113验证 CUDApython - PY import torch import torchvision import torchaudio print(torch:, torch.__version__) print(torchvision:, torchvision.__version__) print(torchaudio:, torchaudio.__version__) print(torch cuda runtime:, torch.version.cuda) print(cuda available:, torch.cuda.is_available()) if not torch.cuda.is_available(): raise RuntimeError(CUDA is not available) print(gpu:, torch.cuda.get_device_name(0)) x torch.randn(2048, 2048, devicecuda) y x x torch.cuda.synchronize() print(cuda tensor ok:, float(y.mean())) PY正常输出应包含torch: 1.11.0cu113 torch cuda runtime: 11.3 cuda available: True cuda tensor ok8. 安装 CleanRL 相关依赖python -m pip install \ gymnasium0.29.1 \ tensorboard2.14.0 \ tyro0.7.3 \ moviepy1.0.3TensorBoard 2.14.0 与较新的 protobuf 版本可能冲突因此固定 protobufpython -m pip install --force-reinstall protobuf3.20.3验证python - PY import torch import gym import gymnasium import tensorboard import google.protobuf import gfootball print(torch:, torch.__version__) print(cuda:, torch.cuda.is_available()) print(gym:, gym.__version__) print(gymnasium:, gymnasium.__version__) print(tensorboard:, tensorboard.__version__) print(protobuf:, google.protobuf.__version__) print(gfootball ok) PY9. 下载 CleanRL 源码mkdir -p ~/rl_projects cd ~/rl_projects git clone https://github.com/vwxyzjn/cleanrl.git cd cleanrl先测试 CleanRL 原始 PPO 脚本python cleanrl/ppo.py \ --env-id CartPole-v1 \ --total-timesteps 5000 \ --num-envs 4 \ --num-steps 128 \ --learning-rate 2.5e-4如果 CartPole 可以正常训练说明 CleanRL 基础训练栈可用。10. 复制 GRF 专用 PPO 脚本不要直接修改原始ppo.py。复制一份cd ~/rl_projects/cleanrl cp cleanrl/ppo.py cleanrl/ppo_grf.py后续只修改cleanrl/ppo_grf.py11. 创建 GRF 到 Gymnasium 的 wrapperGRF 使用旧 Gym 风格接口obs env.reset() obs, reward, done, info env.step(action)CleanRL 当前 PPO 脚本更接近 Gymnasium 接口obs, info env.reset() obs, reward, terminated, truncated, info env.step(action)因此需要写一个 wrapper。创建文件cat cleanrl/grf_env_wrapper.py PY import gymnasium as gym import numpy as np from gymnasium import spaces from gfootball.env import create_environment class GRFGymnasiumWrapper(gym.Env): metadata {render_modes: []} def __init__( self, env_nameacademy_empty_goal_close, rewardsscoring,checkpoints, renderFalse, write_videoFalse, write_full_episode_dumpsFalse, write_goal_dumpsFalse, dump_frequency1, logdir/tmp/football, ): super().__init__() self.env create_environment( env_nameenv_name, representationsimple115v2, rewardsrewards, renderrender, write_videowrite_video, write_full_episode_dumpswrite_full_episode_dumps, write_goal_dumpswrite_goal_dumps, dump_frequencydump_frequency, logdirlogdir, ) obs self.env.reset() obs np.asarray(obs, dtypenp.float32) self.observation_space spaces.Box( low-np.inf, highnp.inf, shapeobs.shape, dtypenp.float32, ) self.action_space spaces.Discrete(int(self.env.action_space.n)) def reset(self, seedNone, optionsNone): super().reset(seedseed) obs self.env.reset() obs np.asarray(obs, dtypenp.float32) return obs, {} def step(self, action): obs, reward, done, info self.env.step(int(action)) obs np.asarray(obs, dtypenp.float32) if info is None: info {} else: info dict(info) if episode in info: info[grf_episode] info.pop(episode) terminated bool(done) truncated False return obs, float(reward), terminated, truncated, info def close(self): self.env.close() def make_grf_env(env_id): env GRFGymnasiumWrapper( env_nameenv_id, rewardsscoring,checkpoints, renderFalse, ) return env PY测试 wrapperpython - PY from cleanrl.grf_env_wrapper import make_grf_env env make_grf_env(academy_empty_goal_close) obs, info env.reset() print(reset ok) print(obs shape:, obs.shape) print(action space:, env.action_space) for i in range(5): obs, reward, terminated, truncated, info env.step(env.action_space.sample()) print(i, reward, terminated, truncated) env.close() print(wrapper ok) PY12. 修改 ppo_grf.py 以接入 GRF对ppo_grf.py做三类修改1. 引入 make_grf_env 2. 将 gym.make(env_id) 替换为 make_grf_env(env_id) 3. 强制检查 CUDA 4. 加入模型保存逻辑执行自动补丁python - PY from pathlib import Path p Path(cleanrl/ppo_grf.py) s p.read_text() if from grf_env_wrapper import make_grf_env not in s: marker from torch.utils.tensorboard import SummaryWriter\n if marker not in s: raise RuntimeError(Cannot find SummaryWriter import.) s s.replace( marker, marker from grf_env_wrapper import make_grf_env\n ) s s.replace( gym.make(env_id, render_modergb_array), make_grf_env(env_id) ) s s.replace( gym.make(env_id), make_grf_env(env_id) ) old device torch.device(cuda if torch.cuda.is_available() and args.cuda else cpu) new device torch.device(cuda if torch.cuda.is_available() and args.cuda else cpu) print(fdevice: {device}, torch: {torch.__version__}, cuda_runtime: {torch.version.cuda}) if device.type ! cuda: raise RuntimeError(CUDA is required for this GRF PPO run, but device is not cuda.) if old in s and CUDA is required for this GRF PPO run not in s: s s.replace(old, new) if save_model: not in s: anchors [ capture_video: bool False\n, capture_video: bool True\n, ] inserted False for anchor in anchors: if anchor in s: s s.replace( anchor, anchor save_model: bool True\n, 1, ) inserted True break if not inserted: print([WARN] save_model field was not inserted automatically.) if model saved to: not in s: anchor envs.close()\n writer.close()\n if anchor in s: save_block if args.save_model: model_path fruns/{run_name}/{args.exp_name}.cleanrl_model os.makedirs(os.path.dirname(model_path), exist_okTrue) torch.save(agent.state_dict(), model_path) print(fmodel saved to: {model_path}) s s.replace(anchor, save_block anchor, 1) else: print([WARN] model save block was not inserted automatically.) p.write_text(s) print(patched:, p) PY检查是否还存在错误的gym.makegrep -n make_grf_env\|gym.make\|device:\|save_model\|torch.save cleanrl/ppo_grf.py | head -80理想状态是存在 make_grf_env 不存在 env gym.make(env_id) 存在 device: cuda 打印 存在 torch.save13. 进行 2 万步连通性训练CUDA_VISIBLE_DEVICES0 python cleanrl/ppo_grf.py \ --env-id academy_empty_goal_close \ --total-timesteps 20000 \ --num-envs 8 \ --num-steps 128 \ --learning-rate 2.5e-4 \ --seed 1启动后应看到device: cuda, torch: 1.11.0cu113, cuda_runtime: 11.3如果显示device: cpu说明 PyTorch CUDA 没有正确安装应停止训练并重新检查 torch 安装。查看 GPUwatch -n 1 nvidia-smiGRF 的仿真 step 主要在 CPU 上执行因此 GPU 利用率不一定很高。只要训练脚本显示device: cuda并且 CUDA 张量测试通过说明 PPO 网络训练已经使用 GPU。14. 正式训练 academy_empty_goal_closecd ~/rl_projects/cleanrl conda activate grf_cleanrl mkdir -p logs CUDA_VISIBLE_DEVICES0 nohup python -u cleanrl/ppo_grf.py \ --env-id academy_empty_goal_close \ --total-timesteps 1000000 \ --num-envs 8 \ --num-steps 128 \ --learning-rate 2.5e-4 \ --seed 1 \ logs/grf_empty_goal_close_seed1_1M.log 21 查看日志tail -f logs/grf_empty_goal_close_seed1_1M.log训练正常时可以看到类似global_step999376, episodic_return[2.] global_step999408, episodic_return[2.] SPS: 38215. 启动 TensorBoardcd ~/rl_projects/cleanrl conda activate grf_cleanrl tensorboard --logdir runs --port 6006浏览器打开http://localhost:6006重点查看charts/episodic_return charts/episodic_length charts/SPS losses/policy_loss losses/value_loss losses/entropy losses/approx_kl如果 TensorBoard 报如下错误TypeError: MessageToJson() got an unexpected keyword argument including_default_value_fields说明 protobuf 版本过高执行python -m pip install --force-reinstall protobuf3.20.3然后重新启动 TensorBoard。16. 查找保存的模型训练结束后执行find runs -type f -name *.cleanrl_model | sort | tail -10保存最新模型路径MODEL_PATH$(find runs -type f -name *.cleanrl_model | sort | tail -1) echo $MODEL_PATH示例路径runs/academy_empty_goal_close__ppo_grf__1__1782704313/ppo_grf.cleanrl_model17. 创建评估脚本创建cleanrl/eval_grf.pycat cleanrl/eval_grf.py PY import argparse import time import numpy as np import torch import gymnasium as gym from ppo_grf import Agent from grf_env_wrapper import GRFGymnasiumWrapper def parse_args(): parser argparse.ArgumentParser() parser.add_argument(--model-path, typestr, requiredTrue) parser.add_argument(--env-id, typestr, defaultacademy_empty_goal_close) parser.add_argument(--episodes, typeint, default10) parser.add_argument(--seed, typeint, default1) parser.add_argument(--deterministic, actionstore_true) parser.add_argument(--cuda, actionstore_true) parser.add_argument(--render, actionstore_true) parser.add_argument(--sleep, typefloat, default0.0) parser.add_argument(--write-video, actionstore_true) parser.add_argument(--write-dumps, actionstore_true) parser.add_argument(--logdir, typestr, defaultgrf_eval_dumps) return parser.parse_args() def make_eval_env(args): env GRFGymnasiumWrapper( env_nameargs.env_id, rewardsscoring,checkpoints, renderargs.render, write_videoargs.write_video, write_full_episode_dumpsargs.write_dumps, write_goal_dumpsFalse, dump_frequency1, logdirargs.logdir, ) env gym.wrappers.RecordEpisodeStatistics(env) return env def main(): args parse_args() device torch.device(cuda if args.cuda and torch.cuda.is_available() else cpu) print(device:, device) print(torch:, torch.__version__) print(cuda runtime:, torch.version.cuda) print(model:, args.model_path) envs gym.vector.SyncVectorEnv([lambda: make_eval_env(args)]) agent Agent(envs).to(device) state_dict torch.load(args.model_path, map_locationdevice) agent.load_state_dict(state_dict) agent.eval() returns [] lengths [] obs, info envs.reset(seedargs.seed) obs torch.tensor(obs, dtypetorch.float32, devicedevice) ep_return 0.0 ep_length 0 finished 0 while finished args.episodes: with torch.no_grad(): if args.deterministic: if hasattr(agent, network): hidden agent.network(obs) logits agent.actor(hidden) else: logits agent.actor(obs) action torch.argmax(logits, dim1) else: action, _, _, _ agent.get_action_and_value(obs) next_obs, reward, terminated, truncated, infos envs.step(action.cpu().numpy()) ep_return float(reward[0]) ep_length 1 if args.sleep 0: time.sleep(args.sleep) done bool(terminated[0] or truncated[0]) obs torch.tensor(next_obs, dtypetorch.float32, devicedevice) if done: finished 1 returns.append(ep_return) lengths.append(ep_length) print( fepisode{finished}, freturn{ep_return:.3f}, flength{ep_length} ) ep_return 0.0 ep_length 0 obs, info envs.reset(seedargs.seed finished) obs torch.tensor(obs, dtypetorch.float32, devicedevice) envs.close() print(mean_return:, float(np.mean(returns))) print(std_return:, float(np.std(returns))) print(mean_length:, float(np.mean(lengths))) print(returns:, returns) if __name__ __main__: main() PY18. 无渲染评估MODEL_PATH$(find runs -type f -name *.cleanrl_model | sort | tail -1) echo $MODEL_PATH CUDA_VISIBLE_DEVICES0 python cleanrl/eval_grf.py \ --model-path $MODEL_PATH \ --env-id academy_empty_goal_close \ --episodes 20 \ --deterministic \ --cuda正常情况下训练充分的academy_empty_goal_close模型会频繁出现接近2.0的 episode return。19. 可视化评估在 Linux 桌面或 NoMachine 环境中执行CUDA_VISIBLE_DEVICES0 python cleanrl/eval_grf.py \ --model-path $MODEL_PATH \ --env-id academy_empty_goal_close \ --episodes 5 \ --deterministic \ --cuda \ --render \ --sleep 0.02如果窗口显示失败使用虚拟显示CUDA_VISIBLE_DEVICES0 xvfb-run -s -screen 0 1280x720x24 python cleanrl/eval_grf.py \ --model-path $MODEL_PATH \ --env-id academy_empty_goal_close \ --episodes 5 \ --deterministic \ --cuda \ --render \ --sleep 0.0220. 保存视频和 dumpmkdir -p grf_eval_dumps CUDA_VISIBLE_DEVICES0 python cleanrl/eval_grf.py \ --model-path $MODEL_PATH \ --env-id academy_empty_goal_close \ --episodes 5 \ --deterministic \ --cuda \ --write-video \ --write-dumps \ --logdir grf_eval_dumps查看输出文件find grf_eval_dumps -type f | sort | head -50如果普通模式没有生成文件可以加上xvfb-runmkdir -p grf_eval_dumps CUDA_VISIBLE_DEVICES0 xvfb-run -s -screen 0 1280x720x24 python cleanrl/eval_grf.py \ --model-path $MODEL_PATH \ --env-id academy_empty_goal_close \ --episodes 5 \ --deterministic \ --cuda \ --write-video \ --write-dumps \ --logdir grf_eval_dumps21. 固化当前稳定版本环境跑通后建议保存脚本和依赖cd ~/rl_projects/cleanrl conda activate grf_cleanrl mkdir -p ~/rl_projects/grf_cleanrl_single_agent_v1 cp cleanrl/ppo_grf.py ~/rl_projects/grf_cleanrl_single_agent_v1/ cp cleanrl/grf_env_wrapper.py ~/rl_projects/grf_cleanrl_single_agent_v1/ cp cleanrl/eval_grf.py ~/rl_projects/grf_cleanrl_single_agent_v1/ python -m pip freeze ~/rl_projects/grf_cleanrl_single_agent_v1/pip_freeze.txt conda env export ~/rl_projects/grf_cleanrl_single_agent_v1/conda_env_export.yml find runs -type f -name *.cleanrl_model | sort ~/rl_projects/grf_cleanrl_single_agent_v1/model_list.txt find grf_eval_dumps -type f | sort ~/rl_projects/grf_cleanrl_single_agent_v1/eval_dump_list.txt生成说明文件cat ~/rl_projects/grf_cleanrl_single_agent_v1/README.md MD # GRF CleanRL Single-Agent v1 This environment is used for single-agent Google Research Football training with CleanRL PPO. Scope: - Google Research Football academy single-agent tasks - PPO training with CleanRL-style script - CUDA training - TensorBoard logging - model saving - model evaluation - video and dump output Not intended for: - multi-agent GRF training - centralized training with decentralized execution - QMIX, VDN, MAPPO multi-agent experiments - full 11v11 multi-agent training Main environment: - Python 3.8 - gfootball 2.10.2 - gym 0.23.1 - gymnasium 0.29.1 - torch 1.11.0cu113 - torchvision 0.12.0cu113 - torchaudio 0.11.0 - tensorboard 2.14.0 - protobuf 3.20.3 Main scripts: - ppo_grf.py - grf_env_wrapper.py - eval_grf.py MD22. 常见问题22.1 缺少 six错误现象ModuleNotFoundError: No module named six处理python -m pip install six1.16.022.2 Gymnasium 找不到 academy_empty_goal_close错误现象gymnasium.error.NameNotFound: Environment academy_empty_goal_close doesnt exist.原因是 GRF 场景不是 Gymnasium 注册环境不能用gym.make(academy_empty_goal_close)应通过from gfootball.env import create_environment再由GRFGymnasiumWrapper适配。22.3 RecordEpisodeStatistics 冲突错误现象ValueError: Attempted to add episode stats when they already exist原因是RecordEpisodeStatistics重复包裹或者 GRF 的info中已有episode字段。处理方式if episode in info: info[grf_episode] info.pop(episode)同时确保grf_env_wrapper.py中不要再额外包一层RecordEpisodeStatistics。22.4 TensorBoard hparams 插件报错错误现象TypeError: MessageToJson() got an unexpected keyword argument including_default_value_fields处理python -m pip install --force-reinstall protobuf3.20.322.5 eval_grf.py 提示 Agent 没有 network错误现象AttributeError: Agent object has no attribute network原因是不同版本的 CleanRL PPO 中Agent结构不同。有些版本使用self.actor nn.Sequential(...)没有共享的self.network处理方式是在评估脚本中兼容两种写法if hasattr(agent, network): hidden agent.network(obs) logits agent.actor(hidden) else: logits agent.actor(obs)23. 后续单智能体场景路线完成academy_empty_goal_close后可以按如下顺序推进academy_empty_goal_close academy_empty_goal academy_run_to_score academy_run_to_score_with_keeper academy_pass_and_shoot_with_keeper示例训练命令mkdir -p logs CUDA_VISIBLE_DEVICES0 nohup python -u cleanrl/ppo_grf.py \ --env-id academy_empty_goal \ --total-timesteps 2000000 \ --num-envs 8 \ --num-steps 128 \ --learning-rate 2.5e-4 \ --seed 1 \ logs/grf_empty_goal_seed1_2M.log 21 当前环境的边界应保持清晰CleanRL 用于单智能体 GRF Academy 任务 多智能体训练另建环境 多智能体算法使用 EPyMARL、MAPPO 官方实现或其他 MARL 框架24. 最终状态完成本文流程后环境应具备如下能力1. 创建 GRF Academy 环境 2. 通过 CleanRL PPO 训练单智能体策略 3. 使用 CUDA 进行网络训练 4. 使用 TensorBoard 查看训练曲线 5. 保存 PPO 模型 6. 加载模型进行评估 7. 渲染策略执行过程 8. 保存视频和 dump 文件至此Google Research Football CleanRL 单智能体训练环境配置完成。