Recreate Deepseek R1 "aha moment" with GRPO on Qwen2.5-3B
AI Impact Summary
Deepseek R1's "aha moment" demonstration, recreated using GRPO and the Countdown Game, highlights the potential for reinforcement learning to unlock unexpected reasoning abilities in large language models. This experiment leverages distributed training with Deepspeed and vLLM on 4x H100 GPUs to train a Qwen2.5-3B-Instruct model, focusing on self-verification and search capabilities. The use of rule-based reward functions, inspired by the DeepSeekMath paper, demonstrates a practical approach to guiding model learning and achieving this initial breakthrough.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium