Announcing Atla’s native integration with Langfuse

Atla team

March 25, 2025

Today we’re announcing our native integration with Langfuse, deepening our joint efforts to support accurate and scalable LLM evaluations. This native integration allows developers to run Atla’s evaluation model Selene 1 as an “LLM-as-a-Judge” within Langfuse’s LLM engineering platform. We’re excited to bring Selene to an engaged community of AI developers.

Langfuse is an LLM engineering platform that helps teams debug and improve their applications by observing traces, running evals, managing prompts, and more.

Selene 1 is our latest evaluation model. It outperforms frontier models—including OpenAI’s o-series and Anthropic’s Claude—across 11 commonly used benchmarks for evaluators. Selene’s performance reflects its strength as a general purpose evaluator that excels at a variety of evaluation tasks.

To set up Selene in Langfuse, simply add your Atla API key in the Langfuse settings.

Use cases

You can use Selene as an LLM Judge in Langfuse to monitor your app’s performance in production using traces, as well as to run experiments over datasets pre-production. We provide demo videos and cookbooks for both use cases.

Monitor your app by running evals over traces

Get started with our RAG app example.

This cookbook builds a Gradio application with a RAG pipeline. The app is a simple chatbot that answers questions based on a single webpage, which is set to Google’s Q4 2024 earnings call transcript. Traces will automatically be sent to Langfuse and scored by Selene. The evaluation example in this cookbook is evaluating the retrieval component of the RAG app by assessing ‘context relevance.’

👉 Monitoring Cookbook

The demo video walks through the same example but evaluates the output of the RAG app by assessing ‘faithfulness.’

‍Conduct offline experiments by running evals over datasets‍

Get started with our experiment to compare model performance to choose a base model. (Other experiments you can run include comparing prompts and retrieval logic.)

This cookbook compares the performance of various models (gpt-4o, o1-mini, and o3-mini) on function calling tasks using the Salesforce ShareGPT dataset. The notebook uploads the dataset to Langfuse and sets up experiment runs on different models. The various outputs are automatically evaluated by Selene.

👉 Offline Evals Cookbook

The demo video walks through the same example.