How to build a general purpose LLM evaluator: Lessons from our literature review

Andrei

January 16, 2025

There is still no surefire way to build a general purpose LLM evaluator, a system capable of judging the quality of AI outputs across tasks, domains, and metrics. It’s a fast moving field and researchers are trying out different approaches at the model, training objective and data level, but there’s no universally accepted recipe.

So we conducted a literature review to understand what’s being worked on, the improvements being made and the remaining challenges. The purpose was two-fold: to refine our own approach to building and figure out how best we can contribute to the field.

Here’s what we learned, how it’s shaping our approach, and the next steps on our roadmap.

Existing evaluators

Broadly speaking, there are two classes of models that compete in the evaluation space: reward models (RMs) and generative models, the latter typically known as LLM-as-a-judge (LLMJ). Each has distinct architectures, outputs, and training approaches. But in the last couple of years the research directions for both of these model types have converged.

For example, you can train a general-purpose LLM to behave more like a reward model with a modified training objective which does not eliminate the text-generation capabilities, train a higher-performance reward model by regularizing its reward loss with an auxiliary text loss, or train an RM on model-generated critiques (i.e., output by an LLMJ) to moderate success.

Projecting this trend into the future, we should get a model that captures the best of both worlds: a model that combines numeric reward output with natural language critiques.

Yet, despite this convergence, there’s no canonical way to get a state-of-the-art general purpose evaluator. Many of the top models share similar approaches, sometimes even training on the same datasets, but each adds some proposed improvement.

Building a better LLM evaluator

These improvements cluster around four areas: data quality and scale, data provenance, data format, and choice of base model.

Data quality and scale. Data is perhaps the most important dimension along which these models can be improved, but there are two distinct schools of thought: maximalist and minimalist. Maximalists build better models by aggressively scaling up their training data; they care about quality, but quantity is the primary objective. Minimalists focus on data quality at the expense of the number of data points, most commonly by filtering their data to increase its quality.

Both strategies have had varying degrees of success. CompassJudger-1, a generative model trained on 900k data points, achieved only 85.4% on RewardBench. But SFR-Judge, another LLMJ, trained on 680k data points, achieved state-of-the-art at the time of release, with 92.7%. In the minimalism camp, SkyworkReward achieved state-of-the-art using a reward model trained on just 80k data points, scoring 94.3% on RewardBench, while Flow Judge, a small LLMJ trained on just 7500 samples, didn’t achieve state-of-the-art on any benchmark. This wide variance suggests there’s more than one way to train a good model, and that there are other variables that also account for performance.

That said, there’s a good reason to expect the maximalist approach to work: LLMs are essentially a success story of scale. Their architectures scale efficiently through repeatable blocks, GPU clusters have scaled to thousands of interconnected devices, and training datasets have grown to tens of trillions of tokens in pre-training and millions of high-quality datapoints in post-training (which is what we do here). Given how well the scaling hypothesis has worked so far, we expect it to continue.

Data provenance (on- vs off-policy). Training models on data generated by the model itself (on-policy) vs generated by other models (off-policy) often leads to better alignment and correction of model-specific errors. Iterative approaches, like Nemotron-4-340B-Reward, which started with high-quality initial data and then generated more training data from the model itself, showcase the potential of self-generated datasets. Despite recent progress, it’s still unclear whether the on- or off-policy decision is key to state-of-the-art performance.

Data format. Existing models use a combination of data formats, like pairwise comparison, direct scoring and classification, but there’s no universally accepted format. Typically reward models are trained on “object-level” interactions, i.e. single-turn or multi-turn conversations between a user and another language model. LLM judges are often trained using supervised fine-tuning on data that contains those interactions as well as judgements (in the form of scores, preferences, or binary verdicts) and critiques, which are explanations of the judgments in natural language. The order of the critiques and scores is important. Critique-first formats generally yield better-calibrated scores by providing detailed reasoning before assigning a verdict. The opposite, where the score is displayed first, leads to rationalizations – critiques which argue for a previously assigned score.

Choice of base model. Both the model architecture (e.g., Llama, Mistral, Gemma) and size (e.g., 8B vs 70B parameters) significantly affect performance. Larger models generally outperform smaller models and have better promptability, but it’s not always clear whether larger models are worth the increased costs associated with training and serving when it comes to LLM evaluators.

For general purpose LLMs, 70B models tend to be much better than 8B models for the majority of use cases. But, for judges, this doesn’t seem to be the case. Percentage improvement from scaling from 8B to 70B can be in the single digits – with costs potentially increasing ten-fold. There are 8B models with specific algorithmic improvements, which outperform many 70B models in targeted evaluation tasks. So the choice of model also comes down to a user’s restrictions (e.g. the need to self-host models for privacy) and specific use case.

Comparison of existing LLM evaluator models

The following table breaks down some of the best evaluators today. The models included in the table are selected either for performance on RewardBench, for novelty of the technical approach, or for their progress in productising LLM-as-a-judge.

Training methods

Typically, LLMJs use Supervised Fine-Tuning (SFT) as the training objective, and reward models use the Bradley-Terry model. To improve on “plain” SFT, researchers started using Direct Preference Optimization (DPO), Relative Preference Optimization (RPO), and Odds Ratio Policy Optimization (ORPO), all of which streamline RLHF by removing reference models (DPO still uses a reference model by default, but there is a reference-free variant) or combining steps. Proximal Policy Optimization (PPO), a form of RLHF, has also been used as a way to improve a general-purpose model by training it against a reward model – though we haven’t seen it used to train an LLMJ yet.

Yet, while selecting the right training objective is important, it’s less important than the quality of the data. So, for now, we’re prioritising simpler objectives – specifically RPO loss (which is roughly DPO loss plus SFT loss) – to reduce memory overhead. More complex methods can be revisited once any data improvements plateau.

Data quality is critical for model performance, with per-aspect annotations (e.g., helpfulness, safety) outperforming overall judgments. And the choice of preferred and rejected responses is more impactful than the quality of generated content itself. For our use case, both the quality of critiques and the careful selection of chosen and rejected critiques are crucial, as these directly define the learning signal for the model. Fine-grained evaluation criteria and detailed rubrics minimize ambiguity and remove implicit assumptions for annotators (LLM or human), ensuring cleaner and less noisy training data.

When it comes to volume, scaling up data benefits smaller models more than larger ones, aligning with our observation that performance gains don’t fully carry over from 8B to larger models. Larger models appear inherently stronger out of the box, making further improvements harder to achieve.

Synthetic data generation and usage

Humans have relied upon the assumption that sharing their knowledge with future generations and experimenting with new ideas can lead to more capable descendants. This assumption has largely been shown to be successful. For AI, the question becomes can a model generate synthetic data that is better than the data it was trained on, thus enabling it to improve itself?

The LLMJ field is new enough that dedicated, domain-specific datasets are sparse or low quality. So generating synthetic data has become a critical technical skill and research area in AI, driven by its importance in creating large, high-quality datasets for model development and driving performance gains.

The methodology seems key in generating high-quality data – to which model performance is particularly sensitive. This involves three stages: generation, curation, and evaluation.

Generation. Prompt engineering is a key component of guiding LM outputs, but it’s unrealistic to expect LLMs to generate the entire desired dataset within a single reference, especially when targeting data with complex structures or semantics. This limitation highlights the need for multi-step generation, a process that breaks down data creation into logical stages. For instance, Prometheus and Flow Judge generate rubric templates for each topic, then multiple user inputs for each rubric, and assistant responses for each possible score in a direct scoring task. These models achieved near-state-of-the-art performance at the time.

Curation. A less commonly used strategy is re-weighting examples based on RMs. This involves amplifying the influence of high-quality examples while down-weighting less relevant data. For example, to create preference pairs, Magpie generated n responses for each prompt, assigning scores to each response using the RM. It used the lowest scoring as a rejected response, and the highest as the chosen, throwing away the rest. Annotating a small subset of examples using a large, high-quality RM and then training a lightweight classifier on these outputs is another innovative strategy. In Textbooks Are All You Need, this method was used to efficiently extend high-quality annotations across a broader dataset by building a random forest classifier trained on the RM-labeled subset. Although not specific to the generative judge use-case, the method proposed in this paper achieved some of the most size-efficient performance.

Evaluation. This overlaps with curation, but evaluation ensures that synthetic datasets align with task requirements, maintain quality, and diversity. Classical approaches, like N-gram overlaps, BLEU and ROUGE, remain interpretable ways of highlighting diversity across data points. Though with the increasing popularity of textual embedding models, tools like t-SNE are instrumental in visualising dataset diversity by projecting data into lower-dimensional spaces. These techniques reveal patterns, highlight clusters, and identify gaps, ensuring the data sufficiently spans task complexities. But automated methods alone cannot catch all inconsistencies due to inherent model biases – human judgment is still necessary. Combining automated metrics with human oversight adds a layer of judgment, addressing subtleties and inconsistencies that automated methods may overlook.

The literature makes clear that LLMs have been able to use existing LLMs to generate data and improve their own capabilities, so, to answer the initial question, yes, a model can generate synthetic data that is better than the data it was trained on, and thus improve itself.

We’re already using synthetic data to train our judge models.

Benchmarks and evaluation datasets

To build a robust general-purpose LLM evaluator, it’s essential to get a horizontal view of the performance across as many different tasks and scenarios as possible (e.g. discourse, dialogue, information retrieval, and summarisation), and across both general and fine-grained metrics (e.g. accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). Language models are prone to hallucinations, biases, and lack of robustness, so an effective evaluator needs to identify these pitfalls while avoiding them in its own generations.

So a comprehensive benchmarking of an LLM-as-a-Judge needs to include meta-evaluation benchmarks that are multifaceted and allow assessment of performance across various tasks and metrics, including robustness to bias and adversarial attacks.

We group them into general purpose and task-specific benchmarks. And within each, are benchmarks for pairwise data, classification tasks, and absolute scoring tasks, reflecting the different use cases a judge LLM could be used for.

General purpose benchmarks evaluate the overall performance of the LLM-as-a-Judge over a variety of tasks. For absolute scoring, Flask, BigGenBench, and MTBench all cover multiple domains. Flask evaluates across 12 domain-specific rubrics, while BigGenBench includes instance-specific rubrics for granular assessments. For pairwise evaluation, RewardBench has gained a lot of popularity due to its open leaderboard. The purpose of this benchmark is to evaluate the ability of a judge to select the best response from a pair of model completions, covering tasks that a chat bot would generally be used for: Chat (Hard and Easy), Code, Math, and Safety. But RewardBench is becoming saturated. This can result from genuine advancements in model capabilities, but also reflects an ongoing arms race between model developers and benchmark creators.

Benchmarks like RewardBench often start off “difficult” for a given generation of models (T). Model builders, eager to post strong results, might either directly train their next-generation models (T+1) on the benchmark or indirectly train them using data similar to that of the benchmark. This leads to rapid improvement but also diminishes the benchmark’s ability to differentiate between genuinely advanced models and those optimized specifically for it. As a result, benchmarks like RMBench and JudgeBench have appeared, which are similar to RewardBench in tasks and datasets, but designed to be harder and thus supersede the original benchmark.

Domain-specific benchmarks provide task-focused evaluations to assess a judge’s effectiveness in evaluating for specific tasks. For example, LLMAggreFact evaluates grounding in evidence across 11 datasets for fact-checking tasks, CodeJudgeEval assesses the classification of failure modes in generated code, and HalluDial evaluates dialogue-level hallucinations.

Addressing bias in LLM-as-a-Judge models

The use of the LLMJ paradigm has been shown to be quite susceptible to biases, but specific frameworks and techniques address this issue.

Position bias, for instance, where the order in which the response pairs appear can influence their decision, can be mitigated by evaluating response pairs with flipped positions and requiring consistent predictions. Length bias – preferring responses that are longer – can be mitigated by comparing model preferences for longer responses against human judgments. And some frameworks tackle multiple biases. CoBBLEr and CALM test for various cognitive biases, including compassion fade, egocentric bias, and bandwagon effect, using perturbations in prompts and responses.

Our strategy is to evaluate our judge in a holistic manner, across domains and for different tasks. This means evaluating across multiple benchmarks: from meta-evaluation standards to bias-specific frameworks.

Launching our models

The lack of a universally accepted recipe for building a general-purpose LLM evaluator is not an obstacle but an opportunity. If training such a model was obvious, I would expect our work to hit diminishing returns quickly. But that’s not happening. Instead, through our work, we have a chance to have an outsized impact on the problem.

In this spirit, we’re excited to announce two upcoming launches. In the next two weeks, we’ll be open sourcing our state-of-the-art 8B general purpose evaluation model, alongside a comprehensive technical report. And, soon after, we’ll be releasing our largest model as part of our paid product.

These are not just more models to score AI outputs. They are a key piece in the larger puzzle of ensuring AI systems remain aligned with human preferences as they grow more capable. Our aim is to outperform existing evaluators both across benchmarks and real-world applications. Watch this space.