Hugging Face: Reproduce DeepSeek R1 'aha moment' using GRPO for Countdown Game with Deepspeed and vLLM | SignalBreak | SignalBreak