MediumCapability

Recreate Deepseek R1 "aha moment" with GRPO on Qwen2.5-3B

AI Impact Summary

Deepseek R1's "aha moment" demonstration, recreated using GRPO and the Countdown Game, highlights the potential for reinforcement learning to unlock unexpected reasoning abilities in large language models. This experiment leverages distributed training with Deepspeed and vLLM on 4x H100 GPUs to train a Qwen2.5-3B-Instruct model, focusing on self-verification and search capabilities. The use of rule-based reward functions, inspired by the DeepSeekMath paper, demonstrates a practical approach to guiding model learning and achieving this initial breakthrough.

Affected Systems

Deepseek R1Qwen/Qwen2.5-3B-Instruct

Date: Date not specified
Change type: capability
Severity: medium

Recreate Deepseek R1 "aha moment" with GRPO on Qwen2.5-3B

More from Hugging Face

Get alerts for Hugging Face