Evaluating our Evaluator: Early Results

Nina

December 3, 2024

It’s ironic, but evaluating an evaluation model is hard.

When you’re building an LLM-as-a-Judge to evaluate the universe of generative AI applications, you need it to be a good judge of multiple domains and tasks – from text summarization to code evaluation – while consistently aligning with human preferences. There are plenty of benchmarks out there to measure against, each assessing different aspects of LLM performance across various tasks and at different levels of granularity. These allow us to evaluate our model’s judgement ability in a well-rounded manner. However, the tricky bit is finding a way to aggregate these results into a coherent performance metric. You can’t take a simple average. So we created our own metric, which we call the Engeler Equation – our North Star for evaluating progress.

Here’s how we’re measuring the performance of our 8B-parameter Atla-1-mini model, some early results using relative preference optimization (RPO) as the training objective, and the Engeler Equation’s role in our development of a state-of-the-art LLM judge.

What are we building toward?

Working every day with teams building generative AI products, we see three main evaluation use cases:

Pairwise comparisons. Comparing two responses to determine which is better for a given prompt. This helps assess improvements in response quality after changes.
Absolute rating. Assigning a score (e.g., 1-5) to individual responses. This is useful for keeping track of the progress of new features in development or monitoring production data in the absence of a ground-truth response to compare against.
Binary classification. Categorizing responses into one of two categories: e.g., hallucinating or truthful, safe or harmful . This method is clear-cut, interpretable, and easier to align with human annotators.

To be a truly general purpose evaluator, our model needs to be robust enough to cover all three. And state-of-the-art performance, for us, means consistently outperforming other models across all three of these evaluation tasks.

How do we measure progress?

Our evaluation strategy has two key pillars. The first consists of proprietary test sets shared by our customers - these are crucial as they directly reflect real-world requirements and help ensure we're delivering value where it matters most. The second is our suite of well-established human-annotated benchmarks designed to assess different aspects of LLM performance. We'll be focusing on these in this post as we can share more details about their methodology and results.

We track our model’s progress by comparing the predicted judgements with those given by humans, effectively measuring its ability to evaluate LLM responses across numerous tasks, use cases and requirements.

For pairwise comparison:

RewardBench. Evaluates LLM judges on human preference alignment across chat, safety, reasoning, and handling out-of-distribution prompts.
InstruSum. Evaluates LLMs on complex instruction-following for text summarization. We only use the human annotated subset, which contains human evaluation choices between different text summaries.
LFQA. Evaluates LLMs on long-form question answering across seven subject areas. Contains human evaluation choices between GPT-3.5 responses and those written by human experts.
HHH. Evaluates the safety of LLMs on helpfulness, honesty, and harmlessness. Contains human evaluation choices based on these axes between different model responses.
EvalBiasBench. Evaluates LLM-as-a-Judge models on bias across six categories: length, correctness, empty reference, content continuation, nested instruction, and familiar knowledge.
Auto-J. Evaluates LLMs on their generative capabilities across eight major domains: summarization, exam questions, code, creative writing, functional writing, rewriting, general communication, and NLP tasks. Contains human evaluation choices between different model responses.

For absolute rating:

BiGGen Bench. Evaluates LLMs on their generative capabilities across nine major domains – instruction following, grounding, reasoning, planning, refinement, multilingual, safety, theory of mind, and tool usage – across 77 different tasks. We use the human evaluation subset, which contains human evaluation scores (based on granular scoring criteria) on responses across 103 different models.
FLASK. Evaluates LLMs across four primary abilities (broken into 12 fine-grained skills): logical thinking, background knowledge, problem handling, and user alignment. Contains both human and GPT-4 evaluation scores on responses across four different models, but we only consider the human ones.
MT Bench. Evaluates LLMs on their ability to engage in multi-turn conversations, assessing contextual understanding and coherence in dialogue. Contains GPT-4 evaluation scores on responses across four different models.

For classification:

InFoBench. Evaluates LLMs on their ability to follow complex instructions, particularly when needing to integrate external information. We use the expert split which contains human yes/no classifications on responses across five different models.
LLM-AggreFact (pre-August 9 2024 update). Evaluates LLMs on factual accuracy in knowledge-intensive tasks. Contains human yes/no classifications on whether different claims are supported by specific documents.

By testing against all these benchmarks, we’re assessing our model’s ability to evaluate a wide range of tasks and requirements – everything from summarization of long-form responses to answers to mathematical questions, and dialogue interactions to coding tasks – and whether it is in alignment with human preferences.

It ensures we’re building towards a truly general evaluator, capable of adapting to different use cases. If Atla-1-mini can perform well across these benchmarks – and others as we continue to add to them – we know we’re on the right track.

But tracking overall performance across this many benchmarks is tricky, because you can’t easily take an average. Pairwise and classification tasks are measured by percentage accuracy, while absolute scoring is measured by Pearson’s correlation - representing how closely Atla-1-mini’s evaluation scores match those given by human annotators.

We need a North Star to follow: a single metric that encapsulates our overall progress.

Enter the Engeler Equation

If you want to build a general-purpose LLM evaluator, you need a general-purpose metric to guide you. We needed something that would 1) allow us to aggregate metrics with different scales, 2) encapsulate our performance gain over a baseline model, and 3) encapsulate our performance gain over GPT-4o. So our CTO Roman came up with the following formula, which we’ve dubbed the “Engeler Equation” after him.

*Performance represents percentage accuracy or Pearson's correlation depending on the benchmark.*

Engeler number	Description
0	Atla = Baseline
0 - 1	Baseline < Atla < GPT-4o
1+	Atla > GPT-4o

‍

This equation measures our model’s performance relative to both a baseline (the pre-trained model) and the current state-of-the-art (GPT-4o). Importantly, it includes a ceiling function that ensures no single benchmark disproportionately influences the overall metric. While it’s not perfect - it provides a clear, balanced metric that guides our development. We aim to achieve an Engeler Equation > 1, which would indicate that we’ve surpassed GPT-4o performance.

Having this North Star is important. Because we’re building a general purpose evaluator, one that works across every different domain and use case that our customers build for, it helps to have a single, simple heuristic that shows we’re moving in the right direction.

And it seems to be working.

The early results: Atla-1-mini

Our Atla-1-mini model, fine-tuned from Llama-3.1-8B, is already showing promising results. Here’s an overview of its performance over this quarter across the three scoring formats and the Engeler Equation.

*Pairwise comparison performance Q4 2024 (% accuracy)*

*Absolute rating performance Q4 2024 (Pearson’s correlation)*

*Classification performance Q4 2024 (% accuracy)*

*Performance as a function of the Engeler Equation Q4 2024*

The overall upward trend is clear.

Across all 11 benchmarks, we’re seeing performance gains over the baseline, and we’re closing on GPT-4o’s performance. This is particularly impressive because this is an 8 billion-parameter model compared to GPT-4o’s 200 billion(?) parameters. We’re making big progress with a much smaller model.

To illustrate a fairer comparison, here’s the latest performance of Atla-1-mini against GPT-4o-mini. Both models comfortably outperform Llama-3.1-8B, and while the exact parameter size of GPT-4o-mini is unknown, Atla-1-mini scores an Engeler number of 1.08 against it.

Our journey to reach this point has not been linear. Along the way, we dedicated significant effort to curating and preparing training data, which sometimes delayed model re-training, as reflected in the plateaus on the line charts. Additionally, there were moments where we prioritised wrong datasets or strategies, leading to temporary dips in performance - a natural part of the research process.

And we truly believe that getting the training data right is worth spending the time on for what we’re trying to build.

We prioritize diversity, selecting datasets that cover a wide range of tasks and evaluation formats. We then filter through each dataset to make sure it’s high quality, and augment it with our own synthetically generated data. This approach ensures that our model isn’t over specialized; it’s task-agnostic and not overly sensitive to specific prompt strategies. It all contributes to creating a well-rounded judge – one that can help developers build GenAI applications better, faster, and safer.

Onwards to 70B – and beyond

We’re pretty excited about how well we’re progressing towards GPT-4o performance with an 8B model, and we’re looking forward to training a 70B model on this data mix very soon and testing that out.

With the 70B model, we aim to surpass current state-of-the-art performance to be the de-facto SOTA evaluation model across the above benchmarks. But this is just the beginning. We’ve got a pipeline of ideas around new data domains, training objectives, and a whole backlog of blue-sky research ideas that we’re itching to implement to push us further. And in the Engeler Equation, we have a reliable indicator to guide us there.