Training an LLM-as-a-Judge with Synthetic Data

Andrei

November 25, 2024

The success of any AI model depends on the amount and quality of data it is trained on. This also applies to LLM-as-a-Judge models that are designed to assess other LLMs based on human preferences. But human annotation is often time-consuming, costly, and prone to inconsistencies.

To overcome these limitations, the field is increasingly turning to synthetic data as a scalable alternative. But using AI-generated samples to train LLM evaluators comes with its own set of challenges. So how can you ensure that the synthetic data you use actually improves your model instead of compromising it? Here at Atla, we have spent a lot of time thinking about this in our efforts to build a leading AI evaluator, and this is what we have learned.

Current approaches

The existing frameworks we have come across demonstrate the potential and challenges of using synthetic data to train AI evaluators effectively:

Prometheus I generates datasets of around 100,000 samples, employing a 1-5 Likert scale, a numerical rating for subjective assessments, to score model outputs. While straightforward, this approach struggles to capture nuanced differences between outputs.

Prometheus II builds upon Prometheus I by introducing pairwise comparisons, enabling evaluators to better assess the relative quality of responses. It improves performance in subjective evaluations, such as coherence and fluency, but still relies on predefined criteria.

FlowJudge is designed for domain-specific applications in industries like healthcare, law, and finance. It relies on smaller, curated datasets of around 7,500 samples and combines classification tasks with 1-3 and 1-5 Likert scales. It achieves high precision in target fields but struggles with generalization, making it less suitable for multi-purpose evaluation.

Another approach, described in the paper Self-Taught Evaluators, begins with unlabeled instructions and generates contrasting outputs to form preference pairs for training. These models iteratively refine their performance across multiple cycles. Unlike earlier frameworks, they do not rely on pre-annotated data, and surpass proprietary models like GPT-4 on benchmarks such as RewardBench and MT-Bench. Despite their promising performance, they can still inherit biases from the models used to generate the synthetic input, underscoring the need for high-quality, representative data.

Common challenges

Across these approaches, we have observed several common challenges:

Data quality: High-quality raw data forms the backbone of effective synthetic data pipelines. Biases in seed data or low-quality sources can compromise the performance of even the most advanced evaluators.

Data diversity: Without sufficient diversity, models risk overfitting to narrow distributions, reducing their adaptability to new tasks and scenarios.

Consistency: Many frameworks struggle to align scoring with reasoning, leading to inconsistencies that undermine reliability.

Optimization trade-offs: Techniques like supervised fine-tuning (SFT), direct preference optimization (DPO), relative preference optimization (RPO), and direct judgment preference optimization (DJPO), each have unique strengths and weaknesses:

SFT supports iterative refinement, but is prone to overfitting.
DPO excels at learning pairwise preferences but is less adept at handling ambiguous or context-dependent preferences.
RPO integrates pairwise and non-paired data, enabling better generalization across tasks, but incurs higher computational costs and may be less effective for highly specialized domains.
DJPO extends DPO by using reasoning-based critiques alongside pairwise preferences. This improves alignment with nuanced human judgments but requires high-quality, diverse datasets to avoid bias or overfitting. Developed by the Salesforce AI Research team, this method has also inspired our approach.

Best practices

While training our multi-purpose AI evaluator, we’ve identified a set of best practices that can help address these challenges and improve evaluation performance:

Starting with high-quality seed data

The quality of synthetic data depends on the quality of its human-generated seed data. This includes raw data (questions, context, and responses from user-system interactions), and preference annotations:

For raw data, leveraging a diverse range of models, including frontier models like GPT-4, can support robust performance across distributions. Applying filtering techniques, such as n-gram overlap analysis, can minimize redundancy while maintaining alignment with evaluation objectives. In the context of pairwise setups, avoiding accidental clues to the correct answers is essential for maintaining evaluation integrity.

For preference annotations, combining AI-generated inputs with at least 60% human-labeled data helps prevent contamination and maintain alignment with real-world standards. High inter-annotator agreement (IAA) can serve as a proxy for reliability and consistency.

Improving diversity with embedding-based clustering

Embedding-based clustering has proven valuable for improving diversity and ensuring evaluators generalize effectively across varied use cases. By incorporating examples across lexical, syntactic, and semantic dimensions, we can create datasets that are both comprehensive and representative. Visualization tools like Nomic help us identify underrepresented areas in datasets, and guide targeted augmentation:

Leveraging multiple evaluators to prioritize disagreements

Involving multiple LLM judges—whether through diverse models or configurations—provides a practical way to uncover edge cases. Disagreements between judgments often highlight subtle issues or ambiguities, offering valuable opportunities for refinement. By focusing on contentious cases (where LLM judges disagree) near decision boundaries, we can improve reliability, particularly in nuanced or high-stakes scenarios.

Refining iteratively with feedback loops

The success of self-taught evaluators highlights the importance of iterative refinement in developing adaptable LLM evaluators. Early-stage evaluations often reveal gaps in synthetic data, scoring rubrics, or evaluation criteria, creating opportunities to fine-tune inputs and align with real-world challenges. In iteration cycles, we prioritize:

Edge cases near decision boundaries: These offer valuable training signals but demand careful curation to prevent overfitting.
Validation through public and private benchmarks: We test our models using a wide array of benchmarks to ensure real-world performance gains and alignment with industry standards. These include prominent public benchmarks like RewardBench, MT-Bench, and HHH, alongside custom benchmarks derived from real world applications in industries such as legal, finance, and medical. This approach provides valuable insights across multiple evaluation dimensions.

Key takeaways

The use of synthetic data offers scalability but also brings its own set of challenges. A systematic approach can help address these effectively. Based on our experience training a what we coined our general purpose evaluator, best practices include:

Starting with diverse, high-quality, human-generated seed data: Use frontier models and human annotations to create robust, representative datasets while preventing contamination.
Using embedding-based clustering: Optimize diversity across domains, tasks, and evaluation methods to improve generalization and reduce overfitting.
Leveraging multi-judge setups: Identify disagreements to refine evaluations near decision boundaries.
Iterating with feedback loops: Prioritize edge cases and validate refinements with public benchmarks and private benchmarks to ensure alignment with real-world demands and industry standards.