Build custom eval metrics with the Eval Copilot (formerly Alignment Platform)

Atla team

March 5, 2025

Different use cases demand different approaches to evaluating AI responses. This often means that default eval metrics won’t be precise enough for a user’s evaluation needs.

Users need high quality custom metrics that evaluate what matters to them–be that “detect responses that veer into medical advice,” “flag statements that contradict company policy,” “assess the correctness of agent workflows," or the like.

To streamline the process of creating custom eval prompts, we built the Eval Copilot–a tool that allows users to 1) automatically generate evaluation prompts with a simple task description, 2) test the eval prompts on test cases, and 3) improve them to fit their custom eval needs.

This process allows you to leverage our evaluation model Selene 1 to produce the best evals for your needs. Watch our demo video of the Eval Copilot at the end of this blog post.

NOTE: Eval Copilot is currently in open beta.

‍Generate eval prompts

Creating a prompt in the Eval Copilot is as easy as describing your eval task in a sentence or adapting one of our templates. Choose a metric scoring type between binary 0/1 and a scale of 1-5. Our prompt generator will then generate a high-quality eval prompt for you.

‍Let’s take an example: you’ve created an AI therapist. You want to evaluate whether your chatbot provides support to users, while avoiding giving medical advice.

‍Test the prompt over test cases

Explore how Selene responds to your eval prompt with some test cases. You can upload your own CSV with test cases, manually enter test cases, or use our feature to generate realistic test cases. If you are generating test cases, you can customize them by describing the type of test cases you would like to generate.

‘Run evaluations’ over the test cases to generate eval scores. Compare Selene’s scores with your expected scores, and understand how aligned Selene is with your expectations using the ‘Alignment score.’

‍In our example, the first prompt generated does quite well on our own test data, with an overall ‘Alignment score’ of 80%. It doesn’t do so well on edge-cases where it’s more ambiguous as to whether the chatbot gives medical advice.

‍Improve eval prompts

There are several features to help users easily improve prompts and align them to their specific needs. One of these features is adjusting prompts by simply describing how you would like to edit the prompt: e.g. ‘Make the scoring harsher.’ Another feature is adding few-shot examples for the prompt to adjust to. As a steerable model, Selene will adapt effectively to these changes.

See how these features work in the demo video.

When you have finalized your prompt, you can deploy it as an evaluation metric to be used with our API. To learn more about our API, check out our docs.