From reward to reason - the role of LLM judges in training models like DeepSeek-R1

Sashank

February 11, 2025

DeepSeek’s R1 model has made waves in the community, both for its impressive technical capabilities and the Chinese AI company’s embrace of open-source principles. Amid the hype and counter-hype generated by R1, lies an important detail that is easy to overlook: DeepSeek’s use of LLM judge (LLMJ) models in training a general purpose reasoning model.

DeepSeek’s team has openly documented their approach, offering clear insight into how they achieved state-of-the-art AI reasoning capabilities—a transparency that distinguishes them from OpenAI, whose methods for training their o-series models remain undisclosed. A key factor in DeepSeek’s success was incorporating an LLM judge in the final training stages to refine and expand their model’s reasoning abilities beyond coding and mathematical problems.

From reward to reason with an LLM-as-a-Judge

LLM judges are becoming an essential part of training state-of-the-art AI models. They address a fundamental limitation of reinforcement learning with human feedback (RLHF): the limited capacity of human evaluators to assess model outputs at scale.

Replacing human feedback with AI feedback (RLAIF) started with using LLMJs to simulate human preferences of LLM responses on specific evaluation dimensions like helpfulness and harmlessness. On the other hand, for complex tasks like mathematical reasoning, rule-based, outcome-based or process-based reward models have been used that all score reasoning paths against known, correct answers (albeit in subtly different ways). But with disciplines outside of coding and mathematics that don’t have obvious notions of “correctness”, there’s no clear reward signal to give a model in training. This is where LLM judges come in. Rather than simply assigning a score, the model provides a reasoned answer as to why one response is preferable to another. This is valuable when dealing with fuzzier, more creative domains, like language, open-ended scientific questions, and logical reasoning – exactly the sort of domains that state-of-the-art AI models are trying to crack.

In fact, LLMJs have been shown to be highly effective at detecting errors beyond their primary training domain, making them good reward models across various contexts and tasks. As OpenAI’s LLM Critics Help Catch LLM Bugs paper notes, fine-tuned LLM judges successfully identified hundreds of errors in ChatGPT training data rated as “flawless” by human judges, even though the majority of that data covers non-code tasks and is thus out-of-distribution for the judge model.

DeepSeek took this concept further. Traditionally, LLMJs have been used to simulate human preferences (e.g., ‘Response A is more helpful than Response B’), but DeepSeek used them to critique and refine R1’s reasoning – broadening the model’s capabilities far beyond math and code.

How DeepSeek used LLM judges to train R1

DeepSeek R1’s training process consisted of multiple interspersed stages of reinforcement learning (RL) and supervised fine-tuning (SFT). In the initial stages, training relied on reinforcement learning using rule-based rewards, which allowed the model to excel at mathematical reasoning and coding tasks – domains where correctness can be objectively measured.

This in itself was impressive, and shows how far you can get with RL alone. DeepSeek’s R1-Zero model was trained solely with large-scale RL without supervised fine-tuning (SFT) as a preliminary step, and it demonstrated remarkable reasoning capabilities. But rules-based evaluation alone is insufficient for training general reasoning abilities.

So, in subsequent SFT and RL stages, DeepSeek introduced additional data, scored using a generative reward model – an LLM-as-a-judge - to expand the model’s reasoning capabilities to other domains. Following the first stage of reasoning-oriented RL, they use the generative reward model to sample high-quality reasoning traces for SFT:

“We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.”

The DeepSeek-V3 technical report further details how this generative reward model provides feedback even when no clear ground truth exists.

“For questions with free-form ground-truth answers, we rely on the reward model to determine whether the response matches the expected ground-truth. Conversely, for questions without a definitive ground-truth, such as those involving creative writing, the reward model is tasked with providing feedback based on the question and the corresponding answer as inputs. The reward model is trained from the DeepSeek-V3 SFT checkpoints. To enhance its reliability, we construct preference data that not only provides the final reward but also includes the chain-of-thought leading to the reward. This approach helps mitigate the risk of reward hacking in specific tasks.”

This process enabled DeepSeek-R1 to generalize beyond mathematical reasoning and apply logical inference in more ambiguous, real-world scenarios. And the use of chain-of-thought critiques helped ensure reliability and prevented the model from exploiting weaknesses in the reward system to artificially boost its scores.

To further align R1 with human preferences, DeepSeek then implemented a secondary stage of RL, using an LLMJ “to capture human preferences in complex and nuanced scenarios”, similar to previous RLAIF approaches. This step played a crucial role in aligning DeepSeek-R1’s capabilities, allowing it to excel on open-ended reasoning tasks while remaining helpful and harmless.

DeepSeek also used its own model as a judge to generate high-quality training data for smaller models, turning these smaller non-reasoning models into reasoning ones by fine-tuning them with outputs from a reasoning model. In other words, DeepSeek-R1 effectively acts as an AI judge by generating training data for smaller models, reinforcing the promise of using AI models as evaluators. That said, DeepSeek-R1 was not specifically trained as an LLM judge; like most AI labs, DeepSeek used its best available model as the judge. While this approach works, we know from our own research that models trained to be dedicated judges through rigorous post-training can outperform off-the-shelf models several times their size on evaluation tasks . Our Selene Mini model is a state-of-the-art, general purpose evaluation model that was trained on dedicated datasets, and it outperforms GPT-4o on RewardBench, the gold standard benchmark for reward models.

What this means for the future of training models

The use of LLM judges isn't new, but their application in training state-of-the-art reasoning models—along with the open publication of this methodology—marks a significant shift.

DeepSeek’s approach offers a compelling case for anyone training sophisticated AI models to use LLM judges in their own training pipelines, whether for evaluating model-generated responses, curating training datasets, or shaping model behavior through reinforcement learning stages. As more AI labs adopt LLMJs to evaluate reasoning traces, we can expect models to become more transparent, adaptable, and capable of handling complex, real-world reasoning.

Find and fix agent failures with Atla.
‍

Download our OSS model

Book a demo

Latest posts

Comparing AI Agent Frameworks: A Guide to Building Reliable Agents

From reward to reason - the role of LLM judges in training models like DeepSeek-R1

From reward to reason with an LLM-as-a-Judge

How DeepSeek used LLM judges to train R1

What this means for the future of training models

Latest posts

Comparing AI Agent Frameworks: A Guide to Building Reliable Agents

AI agent failures in DA-Code: identifying errors and fixing them through critique

Why LLM Agents Still Fail