MediumCapability

Scaling laws for reward model overoptimization

AI Impact Summary

This CAPABILITY update formalizes scaling laws for reward-model overoptimization, likely affecting RLHF-style pipelines where reward models guide policy optimization. As scale increases, the marginal gains from tuning reward signals and data may shift, potentially altering convergence behavior and the stability of aligned outputs. Teams should expect new guidance on the interaction between model capacity, dataset size, and reward signal strength, and plan experiments to identify optimal operating points to avoid diminishing returns or reward gaming.

Business Impact

Organizations using reward-model-based optimization will need to adjust training budgets and monitoring to avoid diminishing returns or unstable alignment as scaling laws change.

Risk domains

785%

Source text

Date: Date not specified
Change type: capability
Severity: medium

Scaling laws for reward model overoptimization

More from OpenAI

Get alerts for OpenAI