TRL introduces RLOO: REINFORCE Leave One-Out RLHF Trainer
AI Impact Summary
This update introduces RLOO, a new online RLHF training algorithm designed for improved accessibility and reduced resource requirements. RLOO leverages a simplified model architecture, reducing GPU memory usage by approximately 50-70% compared to PPO, and significantly accelerates training, up to 3x faster with 6.9B models. The key innovation is modeling the entire generation as a single action, simplifying the reward calculation and enabling more efficient reinforcement learning.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info