Hugging Face: Reproducing DeepSeek-R1 ‘aha moment’ with GRPO RL on Qwen2.5-3B-Instruct using DeepSpeed and vLLM | SignalBreak | SignalBreak