Aligning AI with AI-Assisted Human Feedback

Maurice

December 12, 2024

Aligning AI with AI-Assisted Human Feedback

The hard problem of alignment is teaching AI to handle tasks that humans find too complex to judge on their own. As systems grow more capable, we need ways to extend our reach. One approach we’re excited about is creating an AI evaluator—an “LLM-as-a-Judge”—that can assess the quality of other AI outputs. Training such a model to say “this answer is good” or “that answer is flawed” might sound small. But as Jan Leike, a leading voice in AI alignment, has discussed, it’s an important piece of the bigger puzzle, especially as tasks get more complex.

Scaling Beyond Simple Feedback

Reinforcement learning from human feedback (RLHF) is a popular alignment method. It works great for tasks humans can easily evaluate—for instance, “Did the model follow the instruction to summarize the news headline?” But as Leike points out, RLHF alone doesn’t scale well to truly hard problems. Many tasks that future AI systems face will be too big or complicated for a human to judge at a glance. For example, imagine evaluating the accuracy and coherence of a complex scientific analysis. Humans might not have the time, skill, or confidence to say whether the reasoning is correct.

As tasks grow in complexity, human evaluation becomes a bottleneck. According to Leike’s insights, the solution is to leverage AI itself to assist with evaluation. Instead of humans trying to handle every nuance, we train evaluators—like our own LLM-as-a-Judge—to break down big, challenging tasks into manageable pieces. This approach, known as recursive reward modeling (RRM), aims to help humans focus on what really matters: setting the goals and preferences, rather than getting bogged down in the detailed cognitive labor.

How Recursive Reward Modeling Fits In

RRM is the idea of using AI systems to assist humans in the evaluation process itself. When faced with a task too big for direct human assessment, the system breaks it down into simpler subtasks. Each subtask is easier to judge, and our AI evaluator can step in to say whether those smaller pieces are done correctly. By doing this again and again, we build up a trustworthy evaluation of even very complex tasks.

Our LLM-as-a-Judge is designed to serve as that general-purpose evaluator. Instead of training one specialized model for each domain (like code review, medical advice, or research summaries), we’re working on a single evaluator that can handle them all. This broad capability is crucial: as future AI systems tackle everything from legal arguments to advanced engineering designs, we’ll need a single point of reference to check the quality of their reasoning and outputs.

Reducing Existential Risk Through Better Oversight

So how does building an evaluator help reduce existential risks from AGI? The core idea is that oversight improves safety. If we have a tool that can reliably check an advanced AI system’s reasoning, identify subtle errors, and flag dangerous or unintended consequences, we’re far less likely to let harmful behavior slip through.

By layering this evaluation approach—breaking down complex tasks, having specialized evaluators check each part, and using our LLM-as-a-Judge to guide the process—we improve our odds of catching mistakes before they scale up. This makes it harder for AI to wander off in unexpected, potentially catastrophic directions, because there’s always a watchful eye examining its steps.

A Path to Scalable Alignment

Aligning AI with human values is not solved by one trick. It’s an ongoing effort, and our evaluator is just one component. But it’s a critical one. By automating some of the cognitive labor required for evaluation and anchoring that process in human-defined preferences, we create a scalable oversight mechanism. We ensure that as AI grows more capable, we can still understand and shape its actions.

This is what we’re building. The idea isn’t just to have another model scoring outputs. It’s to open a path where each complicated task can be evaluated at a depth and scale previously impossible for a single human reviewer. By doing so, we aim to make advanced AI systems safer, more trustworthy, and ultimately more aligned with the values that matter most to us.

‍

Find and fix agent failures with Atla.
‍

Download our OSS model

Book a demo

Latest posts

Why Deep Research Agents Fail: Lessons from GAIA

Aligning AI with AI-Assisted Human Feedback

Latest posts

Why Deep Research Agents Fail: Lessons from GAIA

Comparing AI Agent Frameworks: A Guide to Building Reliable Agents

AI agent failures in DA-Code: identifying errors and fixing them through critique