TRL: Introducing MPO, GRPO, and GSPO for Vision Language Model Alignment
AI Impact Summary
This release introduces several advanced alignment techniques for Vision Language Models (VLMs) within the TRL framework. Specifically, it adds support for Mixed Preference Optimization (MPO), Group Relative Policy Optimization (GRPO), and Group Sequence Policy Optimization (GSPO), building on existing SFT and DPO methods. These new methods aim to improve multimodal alignment by extracting richer signals from preference data and scaling better with modern VLMs, offering potential performance gains compared to previous approaches.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info