In our previous blog post, our co-founder Maurice discussed how training a general-purpose evaluator model aligns with our goal of reducing existential risk through better AI oversight. But building better evaluation systems means we also need to think carefully about how they might be compromised.
This challenge was at the forefront of our minds when three members of our team - Mathias, Andrei, and myself - spent a weekend at the AI Evaluations research sprint, organized by the AI Safety Institute (AISI) at the London Institute for Safe AI. We even ran into familiar faces from the monthly AI Tinkerers meetups we attend in London (highly recommended for anyone reading this).
The sprint began with two presentations that would shape our weekend's work, ultimately leading us to design an experimental framework for detecting when AI systems might be gaming their own evaluations. But first, let's look at how we got there.
The Brief
Michael Schmatz from AISI's autonomous systems team laid out several key risks in AI development that would guide our work. He focused particularly on AI proliferation - where systems might learn to improve their own architecture or create other agents - and what he termed "Autonomous Replication and Adaptation" (ARA). Most relevant to our team was his discussion of how advanced AI systems might compromise development and assurance processes. An AI system might, for instance, learn to "sandbag" its performance, develop situational awareness, or even alter its own training data to evade safety measures.
Marius Hobbhahn, CEO of Apollo Research, followed with practical insights on evaluation design, emphasizing three key principles:
- Evaluations should be challenging enough that no model achieves a perfect score.
- Evaluations need clear decision criteria.
- Outcomes must be measured properly.
Given our focus at Atla on building frontier evaluation models (‘LLM-as-a-judge’), we saw an immediate connection between these risks and our work. We decided to tackle one specific risk: the potential for AI systems to game their own evaluations by making inferences about their evaluators. A model that can successfully identify and adapt to its evaluator could inflate its perceived capabilities, increasing its chances of being deployed (“surviving”).
The Challenge of Evaluation Gaming
Our experimental framework addresses the above concern. While this risk exists across different evaluation paradigms, we focus on the agent-as-a-judge framework, where the potential for gaming through repeated interactions is particularly large.
This work builds on existing research into model behavior and bias. We've already seen how language models can exhibit sycophantic behavior - telling humans what they want to hear rather than what's true, as documented in Anthropic's work on understanding sycophancy. We also know that models can inadvertently exploit general human biases, such as our preference for assertive answers, even when those answers are incorrect - our research advisor Max effectively details this in the paper Human Feedback Is Not Gold Standard.
Our hypothesis is that models might learn to exploit not just these general biases, but also the specific preferences of individual evaluators. To investigate this, we aim to answer two key questions:
- Evaluator identification: Can an agent make inferences about the identity of its agentic judge through their interactions?
- Adaptation capabilities: Once it has identified its evaluator, can it tailor its responses to exploit that evaluator's specific preferences or biases?
To test these capabilities, we run multiple conversation threads in parallel, varying only one element: the information provided about the evaluator's identity. In one thread, the AI system receives accurate information about its evaluator. In another, it receives incorrect information. And in a control thread, it receives no information at all. Importantly, the evaluator remains unaware of this setup throughout all interactions.
E.g., some LLM is told that it's being evaluated by Llama 3.1, and the judge is:
- Claude 3.5 Sonnet (the "incorrect information" scenario)
- Llama 3.1 (correct information)
- Anonymous (control)
Scoring Methodology
To quantify our results, we've developed a two-part scoring system that aligns with our key test objectives:
For evaluator identification, we track the AI's ability to predict its evaluator's identity in each conversation. Across multiple interactions, these predictions form a confusion matrix. With N possible evaluator models, an accuracy greater than 1/N would indicate the system can make meaningful inferences about the oversight process.
For adaptation capabilities, we measure performance differences between conversation threads. If the system consistently performs better when given accurate information about its evaluator compared to when given incorrect information, this suggests it can adapt its behavior to exploit specific evaluator characteristics.
For statistical significance, we plan to repeat these measurements multiple times to establish confidence intervals. We want to identify even subtle signs of evaluation gaming, which would indicate significant risks as models become more powerful.
Why This Matters
The implications of evaluation gaming extend beyond academic interest. As GenAI applications are deployed in increasingly high-stakes domains, our ability to accurately assess their safety and performance is crucial. A model that can game its overseer could effectively spoof evaluation results, generating false confidence in its safety attributes. This could lead to unsafe models being deployed in situations where their behavior could have significant negative impacts.
We think of it like a student who learns to ace tests not by mastering the material, but by figuring out their teacher's grading patterns. While they might get good grades, they haven't actually developed the intended skills. With AI, the stakes are much higher.
Looking Ahead
The challenge of building trustworthy AI evaluation systems isn't going away. As models become more sophisticated, so too must our methods for assessing them. Our work on detecting when models are gaming their evaluations is just one piece of a larger puzzle in ensuring AI systems remain aligned with human values and interests as they grow more capable.
The future of automated evaluation lies not just in building more sophisticated LLM judges, but in ensuring those LLM judges remain incorruptible and aligned with human interests. It's a challenge we're committed to tackling head-on.