Today we’re announcing our native integration with Langfuse, deepening our joint efforts to support accurate and scalable LLM evaluations. This native integration allows developers to run Atla’s evaluation model Selene 1 as an “LLM-as-a-Judge” within Langfuse’s LLM engineering platform. We’re excited to bring Selene to an engaged community of AI developers.
Langfuse is an LLM engineering platform that helps teams debug and improve their applications by observing traces, running evals, managing prompts, and more.
Selene 1 is our latest evaluation model. It outperforms frontier models—including OpenAI’s o-series and Anthropic’s Claude—across 11 commonly used benchmarks for evaluators. Selene’s performance reflects its strength as a general purpose evaluator that excels at a variety of evaluation tasks.
To set up Selene in Langfuse, simply add your Atla API key in the Langfuse settings.
Use cases
You can use Selene as an LLM Judge in Langfuse to monitor your app’s performance in production using traces, as well as to run experiments over datasets pre-production. We provide demo videos and cookbooks for both use cases.
Monitor your app by running evals over traces
Get started with our RAG app example.
This cookbook builds a Gradio application with a RAG pipeline. The app is a simple chatbot that answers questions based on a single webpage, which is set to Google’s Q4 2024 earnings call transcript. Traces will automatically be sent to Langfuse and scored by Selene. The evaluation example in this cookbook is evaluating the retrieval component of the RAG app by assessing ‘context relevance.’
The demo video walks through the same example but evaluates the output of the RAG app by assessing ‘faithfulness.’
Conduct offline experiments by running evals over datasets
Get started with our experiment to compare model performance to choose a base model. (Other experiments you can run include comparing prompts and retrieval logic.)
This cookbook compares the performance of various models (gpt-4o, o1-mini, and o3-mini) on function calling tasks using the Salesforce ShareGPT dataset. The notebook uploads the dataset to Langfuse and sets up experiment runs on different models. The various outputs are automatically evaluated by Selene.
The demo video walks through the same example.
We’d love to hear from you
Follow us on X and LinkedIn for more announcements. Join our discussion on Discord.
