If you’re building GenAI applications, creating high quality evaluations is one of the most impactful things you can do. LLMs only reach their full potential when they consistently produce safe and useful results. Without evals, it’s difficult and time intensive to understand how different prompts and model versions might affect your use case.
The "LLM-as-a-Judge" framework is a popular approach for automatically grading AI outputs using a separate AI evaluator. Our focus as a company is to train a frontier model to be a general-purpose evaluator – one that works regardless of the vertical (from legal to medical to financial) or use case (be it hallucinations, coherence, or tone).
Currently, most developers who run model-based evaluations – using one model to judge another – do so by prompting proprietary models like GPT-4o or Claude 3.5 Sonnet. We aim to outperform these models with one specifically designed for evaluation.
So how do you train AI to evaluate itself – and exceed the performance of the leading proprietary models?
While it’s largely settled that preference optimization outperforms supervised fine-tuning (SFT) in alignment tasks for most general purpose models, the same can’t be said for evaluation models. Training an LLM as a judge is a complex piece of work, and the choice between training objectives is an open research question – one that we set out to answer.
Training objectives
First, a quick review of the three training objectives we’re looking at.
Supervised fine tuning (SFT) involves customizing a pre-trained model for specific tasks using a smaller set of labeled data. The goal is to adjust the model’s outputs to match high-quality responses from the training set.
Direct preference optimization (DPO) trains models using pairwise data, where each pair is composed of a preferred (win) and rejected (lose) sample for the same prompt – i.e. which output is better, A or B? So instead of being trained on examples of “correct” outputs (as with SFT), DPO trains the model to understand human preferences between different options. This directly aligns the model with human preference without the need for a separate reward model.
Relative preference optimization (RPO), rather than only learning from pairwise comparison, incorporates a wider array of preference data, including non-paired samples (similarly to how one might do SFT). By integrating preference data from prompts that are non-identical but semantically related, RPO improves the model’s ability to generalize across tasks and better captures the complexity of human preferences.
Experiment one: Pure SFT
In the first experiment, we trained our own 70 billion-parameter model, Atla Caprioska, using pure SFT, and compared its performance with Llama-3.1-70B-Instruct on core benchmarks. The goal was to see how SFT alone would fare in training a general purpose AI evaluator.
While our model showed improvements on in-distribution tasks – where the data fits the patterns on which the model is trained – the quality dropped on out-of-distribution tasks, underperforming Llama-70B on aggregate metrics.
Beyond the numbers, we also saw overfitting when using the model for our blinded critique analysis, where we compare critiques generated by our model with critiques from the base model side-by-side.
Here’s how the models compared across different critique tasks:
In five out of six cases, we preferred the base model’s output, while our model’s critiques looked very much like the data it was trained on. Our model was overfitting to the training data, harming both score metrics and critiques, and underperforming the base model.
In short, while pure SFT can improve a model’s performance on the datasets it’s been trained on, it’s prone to overfit to that data, limiting its ability to generalize – less than ideal for a general purpose AI evaluator.
Experiment two: SFT vs DPO vs RPO
Next, we investigated the effectiveness of the different training objectives – SFT, DPO, and RPO. Inspired by the approach of the Salesforce AI Research team, we wanted to determine whether preference optimization techniques like DPO and RPO could yield better results than SFT alone as an appropriate training objective for LLM-as-a-judge models.
To do this, we trained three models on two training data sets: the first using only SFT, the second using only DPO, and the third using RPO, whose compound loss objective essentially incorporates both SFT and DPO. We then compared their performance against a base (Llama-3.1-8B-Instruct) across four core benchmarks.
RPO showed the most promise, outperforming SFT and DPO across multiple metrics.
Here’s a summary of the key findings:
- DPO performed best on the on PreferenceCollection with 98.89% accuracy.
- RPO performed best on RewardBench with 81.96% accuracy.
- RPO outperformed both SFT and DPO on UltraFeedback (No CoT), with a score of 0.57.
- RPO achieved the highest average Pearson correlation on evaluation scores (0.49), compared to SFT (0.43) and DPO (0.43).
Not only was RPO the most robust in terms of generalization, but it also delivered the most reliable and accurate evaluations across a range of tasks. While RPO marginally underperformed the base model in the out-of-distribution BiGGen-Bench benchmark, it surpassed pure SFT. And we should be able to tolerate getting worse in one direction if it means getting much better in another!
RPO > DPO > SFT?
Our experiments so far suggest that preference optimization outperforms SFT alone, and more specifically, that RPO outperforms DPO for training evaluator models. Why?
This paper on preference optimization suggests it’s down to RPO widening the gap between the probabilities of desirable and undesirable responses, making the good ones more likely and the bad ones less likely. SFT, by contrast, indiscriminately drives up the probabilities of the answers it sees during training – which may or may not be useful. We saw a similar trend in our own results.
As you train using SFT, the likelihood of both accurate and inaccurate judgements increase.
RPO shows an increase in good judgements, and a decrease in bad judgements.
That’s why we saw overfitting in our first experiment with pure SFT. The key difference between RPO and DPO lies in the magnitude of this separation – while both techniques improve the distinction between chosen and rejected sequences, RPO creates a larger gap, making it more effective at reinforcing desirable outputs.
Next, we need to validate these results at a larger scale. We’ll evaluate these models on the rest of the benchmarks from the Salesforce AI Research paper, and run a similar experiment with a 70B base model to see if the trends hold for larger models. We are also going to bring datasets into the Salesforce Research format (critique + judgment, judgment, and response reconstruction) one by one and see if there are additional performance gains to be had there.
The next frontier
To align AI systems that are increasingly complex, we’ll need better tools. Automated evaluators will be crucial in bridging the gap between machine capabilities and human expectations. ‘LLM-as-a-judge’ evaluators are powerful instruments for assessing generative AI systems – but pose new challenges in training and alignment. We’ve shown that applying RPO to instruction-tuned LLMs has promise to help humans evaluate LLM-generated outputs. We are now scaling that work and putting it into practice.