MediumCapability

Reproduce DeepSeek R1 'aha moment' using GRPO for Countdown Game with Deepspeed and vLLM

AI Impact Summary

The post documents reproducing the DeepSeek-R1 ‘aha moment’ by applying Group Relative Policy Optimization to an open model (Qwen2.5-3B-Instruct) using a Countdown game task. It describes a full RL training stack (GRPOTrainer, TRL, Transformers) orchestrated with Deepspeed and vLLM for distributed and accelerated generation, run on a multi-GPU node (4x H100). The approach uses two reward functions (Format Reward and Accuracy Reward) and yields observable progress logs and completed solutions, providing a concrete blueprint for researchers to explore self-verification and long-horizon reasoning in open LLMs, albeit with substantial compute requirements and setup complexity.

Affected Systems

DeepSeek-R1GRPO

Date: Date not specified
Change type: capability
Severity: medium

Reproduce DeepSeek R1 'aha moment' using GRPO for Countdown Game with Deepspeed and vLLM

More from Hugging Face

Get alerts for Hugging Face