HighCapability

Fine-tuned Open LLM Judges Outperform GPT-5.2 for Evaluation

Action Required

Organizations can significantly reduce the cost and latency of LLM evaluation by adopting fine-tuned open-source judge models, enabling faster iteration and more efficient model selection.

AI Impact Summary

Open-source LLM judges, specifically GPT-OSS 120B, can outperform GPT-5.2 at evaluating model outputs by leveraging Direct Preference Optimization (DPO) on a relatively small dataset of preference pairs. This represents a significant cost and speed advantage compared to using the proprietary GPT-5.2, making it a viable alternative for scaling evaluation tasks. This capability unlocks the potential for more transparent and cost-effective LLM evaluation, particularly for organizations seeking to avoid vendor lock-in.

Affected Systems

GPT-5.2

Date: Date not specified
Change type: capability
Severity: high

Fine-tuned Open LLM Judges Outperform GPT-5.2 for Evaluation

More from Together AI

Get alerts for Together AI