Best practices for evaluating AI across multiple criteria

Sashank

March 20, 2025

Over the past few weeks, we've heard from dozens of teams that their understanding of AI performance is often multi-dimensional. Whether for customer support, content generation, or agent workflows, teams want a comprehensive view of how their AI performs. A chatbot’s response doesn’t just need to be accurate; it likely also needs to be helpful, concise, and logically sound.

So what’s the best way to evaluate across multiple metrics? Some teams start with complex, multi-layered evaluations, trying to measure several dimensions in a single step. But across industries, we consistently see that the best results come from evaluating each metric independently before synthesizing insights across criteria.

‍Best practices

Instead of bundling multiple factors together, teams that assess each metric—whether factuality, conciseness, or logical soundness—in isolation first receive clearer insights and more reliable scores. This method makes it easier to diagnose weaknesses, improve model outputs, and build a well-rounded evaluation process.

To apply these best practices with Selene, the key is to evaluate each criterion as an individual metric. Selene is trained for fine-grained analysis, performing best when it scores one dimension at a time.

We built a cookbook to streamline this process. It allows you to run multiple evaluation metrics over a dataset, with Selene generating scores and critiques for all selected metrics per response before moving to the next. This ensures structured, scalable evaluations without unnecessary complexity.

Cookbook walkthrough

Setup:

Install Atla’s python sdk and initialize Atla client using your API key. You can get your API key for free by signing up here and going to your Dashboard.

Run a simple multi-criteria eval:

To set up multi-criteria evals, isolate each criterion and call the Selene API individually on each of them.

We provide a simple example to illustrate this by evaluating a single input/output with two separate evaluation criteria. Try evaluating the AI’s response to a question on capital cities by checking for 1) factual errors and 2) spelling errors.

↪️ Selene catches the factual error and judges that there are no spelling errors.

To apply this to simple business use cases, we define a function to run multi-criteria evals over a pandas dataframe. You can run Selene evaluations asynchronously using Atla’s async client as shown below.

↪️ Specify your different criteria and Selene will output scores and critiques for each criteria on each datapoint, like this:

Multi-criteria evals on a public dataset:

In the second section of the cookbook, we test how well Selene performs on complicated multi-criteria eval tasks using a public benchmark called FLASK. We also provide ways to compare human scores and Selene’s scores, which you can apply to your labelled datasets.

FLASK is a benchmark dataset consisting of 1,740 human-annotated samples from 120 NLP datasets across multiple domains. Evaluators assign scores from 1 to 5 for each annotated skill. The dataset includes predefined scoring rubrics for 12 criteria, which is what we use for our multi-criteria evals. These include Insightfulness, Factuality, Comprehension, Conciseness, Logical Robustness, and more.

Once you load the dataset, run evals with Selene exactly as we did earlier.

↪️ Finally, you can visualize Selene’s scores with respect to the human/expert scores using the three simple analyses we provide. This includes a score distribution graph:

For use cases like these, we have found that teams like to align Selene more closely with their use case. We have built the Eval Copilot (beta)to do exactly that! You can check that out here.

‍Improve your AI

Leverage Selene’s multi-metric evaluations to:

Compare performance across different prompts or models to find the best-performing configurations.
Spot trade-offs—a response that’s highly insightful might sacrifice conciseness. Multi-metric evaluations help you balance these dimensions.
Set eval benchmarks for deploying models in production, ensuring responses meet quality thresholds across all critical factors.

Try our cookbook.

For more help setting up your eval metrics, book a session with our team.