Understanding R1-Zero-Like Training: A Critical Perspective

Source | HN Comments

该研究批判性地分析了类似R1-Zero的训练方法，重点关注基础模型和强化学习。研究发现，DeepSeek-V3-Base展现了“Aha moment”，Qwen2.5基础模型无需提示模板也能表现出色。在强化学习方面，GRPO存在偏差，提出了Dr. GRPO作为改进方案。模板和问题集共同影响RL动态，不匹配的模板可能破坏推理能力。研究还发现，Llama也能通过RL调整，特定领域的预训练可以提高上限。最终，研究提出了极简的R1-Zero训练方案，并在Qwen2.5-Math-7B上实现了先进性能。

理解类似 R1-Zero 训练：一种批判性视角

跳至内容

sail-sg / understand-r1-zero Public

Notifications You must be signed in to change notification settings
Fork 7
Star 132

License

MIT license

132 stars 7 forks Branches Tags Activity

sail-sg/understand-r1-zero

main

Branches Tags

Folders and files

Name| Name| Last commit message| Last commit date ---|---|---|--- analysis| analysis assets| assets datasets| datasets examples| examples understand_r1_zero| understand_r1_zero .gitignore| .gitignore LICENSE.txt| LICENSE.txt Makefile| Makefile README.md| README.md evaluate_model.py| evaluate_model.py pyproject.toml| pyproject.toml train_zero_math.py| train_zero_math.py understand-r1-zero.pdf| understand-r1-zero.pdf

Latest commit

History

12 Commits

Understanding R1-Zero-Like Training: A Critical Perspective

🎉 Updates • 🔗 Links • 📖 TL;DR

💻 Usage • 🍊 Citation • 🌻 Acknowledgement

Updates

21/03/2025: 🎉 我们发布了我们的论文、模型和代码库。我们的 R1-Zero 训练由 🌾 Oat 实现，Oat 是一个高度模块化、对研究友好且高效的 LLM RL 框架。

TL;DR

为了理解类似 R1-Zero 的训练，我们批判性地考察了两个核心组成部分：基础模型和强化学习。我们在下面重点介绍我们的发现。

On base models:

DeepSeek-V3-Base 已经展现出 "Aha moment"。

作为类似 R1-Zero 训练的流行选择，Qwen2.5 基础模型即使没有提示模板也表现出强大的推理能力：平均基准分数提高了 ~60%（与传统的 4-shot 提示相比）！

On reinforcement learning:

GRPO 导致有偏差的优化！我们提出了一个简单的修复方法，可以在保持推理性能的同时提高 token 效率，称为 Dr. GRPO（GRPO D one R ight）。

在类似 R1-Zero 的训练中，模板和问题集共同影响 RL 动态。
- （左图）对于 Qwen2.5-Math-1.5B，不匹配的模板（例如，R1 模板）实际上会破坏 RL 重构之前的推理能力。这使得改进在表面上令人印象深刻。
- （中图）但是，如果模板没有太偏离预训练分布，即使是一个小的、完全 o.o.d. 问题集（例如，GSM8K）也可以通过加强正确的推理行为而不是注入新知识来同样好地诱导推理能力。

除了 Qwen 之外，Llama 也可以从基础模型进行 RL 调整。在这种情况下，特定领域的预训练将提高 RL 上限。
- （右图）GRPO 甚至可以通过增加输出长度使具有数学知识的 Llama "Aha"；但是，这可能是由于其长度偏差造成的，可以使用 Dr. GRPO 消除。

Our minimalist R1-Zero recipe:

我们的分析表明，类似 R1-Zero 训练的极简配方：

我们使用 MATH 3-5 级问题上的（无偏差）Dr. GRPO 算法和 Qwen-Math 模板对 Qwen2.5-Math-7B 进行 RL 调整，并且仅在 8× A100 GPU 上花费 27 小时的计算时间即可实现最先进的性能。

如果您对更多详细信息感兴趣，请查看我们的 paper！

Usage

Install

我们建议使用干净的 python==3.10 环境进行开发。

# Install vllm & oat, the LLM RL framework we developed r1-zero training on.
pip install vllm==0.7.2 && pip install oat-llm==0.0.9
# Install this package locally to use the math grader.
git clone git@github.com:sail-sg/understand-r1-zero.git && cd understand-r1-zero
pip install -e .

Training

我们通过扩展 Oat 的 Learner 和 Actor 组件来实现 R1-Zero 训练。有关分步指南，请参见 train_zero_math.py。

# Patch LD_LIBRARY_PATH to avoid dependency errors:
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH
# Run the experiment (tested on 8 x A100-40G) with Dr. GRPO:
# (change to `--critic_type grpo` for running GRPO)
python train_zero_math.py \
  --critic_type drgrpo \
  --gpus 8 \
  --enable_prefix_caching \
  --collocate \
  --vllm_sleep \
  --vllm_gpu_ratio 0.35 \
  --gradient-checkpointing \
  --flash-attn \
  --bf16 \
  --rnd-seed \
  --learning_rate 0.000001 \
  --lr_scheduler constant \
  --num_ppo_epochs 1 \
  --beta 0 \
  --oracle_type reward \
  --oracle math \
  --pretrain Qwen/Qwen2.5-Math-1.5B \
  --prompt_template r1 \
  --zero-stage 2 \
  --ref_offload \
  --prompt_data ./datasets/train/math_12k \
  --train_split train \
  --input_key problem \
  --output_key answer \
  --max-train 9999999 \
  --num_prompt_epoch 20 \
  --prompt_max_length 1024 \
  --num_samples 8 \
  --temperature 1 \
  --top_p 1 \
  --generate_max_length 3000 \
  --save_steps -1 \
  --train_batch_size 128 \
  --rollout_batch_size 128 \
  --rollout_batch_size_per_device 16 \
  --pi_buffer_maxlen_per_device 128 \
  --eval_batch_size 200 \
  --eval_steps 16 \
  --eval_temperature 0 \
  --eval_generate_max_length 3000 \
  --eval_data ./datasets/evaluation_suite \
  --eval_input_key input \
  --use-wb \
  --wb-run-name qwen2.5-Math-1.5b-r1-zero \
  --wb_project oat-zero

有关更多示例脚本，请参见 here。

Evaluation

# Evaluate our models:
python evaluate_model.py --model_name sail/Qwen2.5-Math-7B-Oat-Zero
python evaluate_model.py --model_name sail/Qwen2.5-Math-1.5B-Oat-Zero
python evaluate_model.py --model_name sail/Llama-3.2-3B-Oat-Zero --template r1
# Evaluate baseline models:
python evaluate_model.py --model_name Qwen/Qwen2.5-Math-1.5B
python evaluate_model.py --model_name Qwen/Qwen2.5-Math-7B
python evaluate_model.py --model_name hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero
python evaluate_model.py --model_name PRIME-RL/Eurus-2-7B-PRIME-Zero
python evaluate_model.py --model_name Open-Reasoner-Zero/Open-Reasoner-Zero-7B

Citation

如果您认为我们的工作对您的研究有用，请考虑引用：

@misc{liu2025understanding,
 title={Understanding R1-Zero-Like Training: A Critical Perspective},
 author={Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin},
 year={2025},
 howpublished={\url{https://github.com/sail-sg/understand-r1-zero}},
}

Acknowledgement

This work is supported by Sea AI Lab for computing resources.
The training codes are built on Oat, which employs vLLM, DeepSpeed and launchpad.
The base models are from Qwen2.5-Math, Llama, and DeepSeek.
We thank Qingfeng Lan for his time in thoroughly reviewing our code.

About

Understanding R1-Zero-Like Training: A Critical Perspective

Languages

标题: 理解类R1-Zero训练：一种批判性视角

Understanding R1-Zero-Like Training: A Critical Perspective

理解类似 R1-Zero 训练：一种批判性视角

License

sail-sg/understand-r1-zero

Folders and files

Latest commit

History

Understanding R1-Zero-Like Training: A Critical Perspective

Updates

Links

TL;DR

On base models:

On reinforcement learning:

Our minimalist R1-Zero recipe:

Usage

Install

Training

Evaluation

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages