Reproduce DeepSeek R1 'aha moment' using GRPO for Countdown Game with Deepspeed and vLLM
AI Impact Summary
The post documents reproducing the DeepSeek-R1 ‘aha moment’ by applying Group Relative Policy Optimization to an open model (Qwen2.5-3B-Instruct) using a Countdown game task. It describes a full RL training stack (GRPOTrainer, TRL, Transformers) orchestrated with Deepspeed and vLLM for distributed and accelerated generation, run on a multi-GPU node (4x H100). The approach uses two reward functions (Format Reward and Accuracy Reward) and yields observable progress logs and completed solutions, providing a concrete blueprint for researchers to explore self-verification and long-horizon reasoning in open LLMs, albeit with substantial compute requirements and setup complexity.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium