Frontier AI needs frontier evaluators. Meet Selene.

Atla team

February 26, 2025

We are excited to introduce Atla Selene 1, a state-of-the-art LLM Judge trained specifically to evaluate generative AI responses. Selene 1 is the best model on the market for evals, beating frontier models from leading labs—including OpenAI's o-series, Anthropic's Claude 3.5 Sonnet, and DeepSeek's R1—across 11 commonly used benchmarks for evaluators.

***Figure 1:*** *Selene 1 (green) achieves state-of-the-art performance averaged across 11 benchmarks commonly used for evaluators, outperforming frontier models.*

‍Selene is a general purpose evaluator, excelling at a wide range of evaluation tasks including absolute scoring (e.g. on a scale from 1-5), classification (e.g. answer with yes/no), and pairwise preference (e.g. which is better, A/B). It backs its scores up with actionable chain-of-thought critiques, and can be used in diverse applications with or without reference responses—e.g. detecting hallucinations in RAG systems, assessing logical reasoning in agents, or verifying correctness in specific domains.

Selene responds well to fine-grained steering, allowing users to customize evaluation criteria precisely for their needs. To leverage Selene’s instruction-following capabilities, we’re also launching our Eval Copilot in open beta (formerly called the Alignment Platform)—a tool that helps users automatically generate, test, and refine custom evaluation metrics with just a description of their task, little-to-no prompt engineering required.

Selene 1 is available now through our API and SDK, with comprehensive documentation and default metrics to help users get started. Our Eval Copilot is also available in beta to all users, with features to help build evaluation metrics for custom use cases. Start for free.

A frontier model for evaluation

Selene 1 outperforms frontier models from other leading labs (Figure 1) on performance across 11 benchmarks commonly used for evaluators. These include OpenAI’s o-series of reasoning models including the latest o3-mini as well as o1, o1-mini and GPT-4o; Anthropic’s Claude 3.5 Sonnet; Meta’s newest Llama 3.3; and DeepSeek’s most performant reasoning¹ model R1.

Selene’s performance reflects its strength as a general purpose evaluator that excels at a variety of evaluation tasks. To achieve this, we trained Selene using a similar methodology to its smaller open-source counterpart—Selene Mini—on a selection of carefully curated evaluation datasets (read more in the technical report). Selene is capable of accurately judging:

Fine-grained absolute scoring tasks, e.g. "Evaluate the logical coherence of this response on a scale of 1-5."
Classification tasks, e.g. "Does this response address the user query? Answer Yes or No."
Pairwise preference tasks, e.g. "Which of the following models responded more empathetically to the user - A or B?"

***Table 1:*** *Performance of Selene 1 (green) and other frontier models broken down by benchmarks. Average denotes micro-average. Bold numbers denote highest in each row.*

Selene excels at capturing human preferences on nuanced and complex evaluations, as demonstrated by its state-of-the-art performance on FLASK (Fine-grained Language evaluation based on Alignment SKill sets). On this challenging benchmark used to measure human alignment on fine-grained evaluation rubrics, Selene achieves a ~0.71 Pearson correlation with human scores. Selene achieves state-of-the-art performance on absolute tasks overall, and on MT-Bench, when evaluating increasingly complex multi-turn conversations. Selene also outperforms frontier models on Auto-J and RewardBench, two popular benchmarks that capture alignment with human preferences between pairs of LLM responses on chat, reasoning, safety and other diverse real-world domains.^‍

¹ Reasoning models were given a maximum of 2048 completion tokens for the task, while all other models including ours were given 1024.

Steering Selene for custom eval use cases

Different use cases demand subtly different approaches to evaluating AI responses. We trained Selene to be easily customizable: it excels at following evaluation criteria and score rubrics closely, and responds well to fine-grained steering.

For instance, developers using LLM Judges frequently encounter the problem of evals getting saturated, i.e. model responses receiving high scores too frequently, making the eval less useful. In such situations, one might want to “make it harsher” such that fewer responses receive high scores. Alternatively, one may want to “flip the scores” so that the eval gives high scores to failures rather than successes. We tested Selene’s steerability by tweaking the score rubric on the FLASK dataset to be “harsher” or “flipped”. We found that Selene’s grading curve is successfully steered in both cases (Figure 2, left: curve moved downwards, right: curve flipped horizontally) while retaining its scoring sensitivity (slope of curve).

***Figure 2:*** Selene 1 responds well to fine-grained steering through prompt tweaks. Plots show Selene’s grading curve in black i.e. proportion of high scores (>3) as a function of ground truth. Green curves show tweaking the prompt to be “harsher” (left) steers the curve downward, while tweaking it to “flip” scores (right) flips the curve while retaining scoring sensitivity (slope of curve).

Eval Copilot (formerly Alignment Platform)

To help users leverage Selene’s customizability, we are releasing our Eval Copilot in beta to all users. The Platform allows users to automatically generate evaluation prompts with a task description, test the prompts on test cases, and improve them to fit their custom eval needs.

There are several features to help users easily align their evaluation prompt to their specific needs. One of these features is adjusting prompts by simply describing how you would like to edit the prompt: e.g. ‘Make the scoring harsher.’ Another feature is adding few-shot examples for the prompt to adjust to. As a steerable model, Selene will adapt effectively to these changes.

Let’s take an example: you’ve created an AI mental health support chatbot. You want to evaluate whether your chatbot provides support to users, while avoiding giving medical advice. Watch our demo.

To learn more about the Eval Copilot, check out our guide.

Get started‍

Selene 1 is designed for straightforward integration into existing workflows and works smoothly with popular frameworks like DeepEval, Langfuse, and more—just drop it into your pipeline.‍

‍API: Selene is available via API, with a structured input-output format.‍
Eval Copilot (beta): Measure what matters for your AI application. Use the Alignment Platform to create and refine custom evaluation metrics.

Read our docs. We’re excited to hear more from the community. To discuss, join our Discord!