Building a great GenAI app requires generating high-quality AI responses for a large volume of custom user inputs. This means developers need a good system for running evaluations during both development and production. Here are my learnings from looking at dozens of implementations.
Clarify Evaluation Goals: Set the key metrics that align with your application's objectives. Having clear goals will guide tool selection and evaluation design. You may also want to find a “Principal Domain Expert” whose judgment is crucial for the success of your AI product.
Choose the Right Tool for Your Team: Align the tool's capabilities with your team's expertise and workflow. For developer-centric teams, code-first tools like LangSmith or Langfuse may be preferable; if you’re collaborating with non-technical subject matter experts, you may find e.g. Braintrust serves your needs better.
Leverage AI for Efficiency: Use an LLM-as-a-Judge approach to scale qualitative evaluations effectively. Some tools even offer features that let you use AI to generate datasets and evaluation prompts, saving time and resources.
The Landscape of AI Evaluation Tools
There’s a HUGE number of LLM eval tools - I will focus on the ones with the highest adoption and most convincing offering.
Most tools are built to support straightforward workflows that look something like the below:
- Define a prompt.
- Define a test dataset.
- Define an evaluation metric.
- Run experiment: Run the prompt on the dataset with your eval metric.
- Log production runs with metadata like prompt versions, latency, and cost.
- Analyze results and identify areas for improvement.
We can roughly segment eval tools along three axes. Many top tools are expanding beyond their original focus, but it still helps to understand where each tool got its start:
- Dev vs. GUI tools: Many tools target developers, where prompts and evaluations are defined in code. Others target a less technical audience, offering prompts and evaluations through a user interface.
- Development vs. deployment tools: Another divide is between tools designed for development evaluations and those for production evaluations. Development eval tools help you develop prompts with evaluations before deployment. They are less useful for developers who can build this workflow directly in code. Production eval tools let you evaluate prompts after deployment, which is useful for both developers and non-technical users.
- Enterprise vs. startup tools: Finally, some tools are built for enterprise clients with role-based access control, audit logs, etc. whereas others target smaller teams and solo developers who crave flexibility, e.g. with a focus on open-source libraries and documentation.
Key Considerations before Adopting an LLM Evaluation Tool
When selecting between different options, the below are some of the most important factors to think about.
User interface: The best tools offer intuitive interfaces accessible to both developers and non-technical team members. An efficient prompt playground allows for quick iterations, enabling non-developers to compare prompts side-by-side, view inputs and outputs in various formats (such as markdown or audio), and compute evaluations directly within the playground. On the other hand, a flexible SDK/API with good documentation lets developers integrate and customize this as needed.
Customization: The more advanced your LLM app, the more flexibility you'll want for customization. Beyond the table stakes of having prompts with variable substitution, you might also want prompts with if/then logic, tool use like RAG or API calls, multimodal input/output. Aside from prompts, you may also want evals defined as prompts or code, online evals run on your code, alerting and graphing capabilities, etc. Most successful teams eventually need a custom eval pipeline (e.g. 30+ eval dimensions are not uncommon for clinical aspects in healthcare applications). Customizability also lets you adopt new foundation model releases without depending on the tool provider to update their tool to serve your use cases.
Scalability: Your tooling should handle large volumes of data without significant performance degradation or increased latency. This often motivates buying a tool instead of building internally. At the same time, as usage grows, the pricing model should remain sustainable, especially when logging and tracing are involved.
Integration: Smooth integration with existing development workflows and rapid adoption of new LLM features are important. Your tool should offer straightforward mechanisms for logging model interactions and tracing outputs back to their origins, ensuring compatibility with current systems. Flexibility in choosing different LLM judges is also crucial. This lets you experiment with various evaluators, including fine-tuned ones, and helps prevent self-bias in your evals.
Community and Support: An active community on Discord, Github, etc. can be a valuable resource for shared knowledge and troubleshooting. Comprehensive documentation with clear guides and examples facilitates smooth onboarding and effective use of the tool's features. Enterprise solutions may lack collaborative communities but provide robust enterprise support.
Below is a comparison of three popular LLM evaluation tools, highlighting their strengths and limitations.
TL;DR
🦜🔗 LangSmith:
- Pros: Highly flexible with code-first prompt and eval definitions, robust prompt playground, extensive evaluation metric support.
- Cons: Expensive pricing for high-volume tracing, no support for audio, limited tool use flexibility.
🗨️ Braintrust:
- Pros: Excellent UI for non-coders, robust prompt management, flexible evaluation options, cost-effective for larger volumes.
- Cons: Limited support for complex tool integrations and audio, some eval UI limitations.
🪢 Langfuse:
- Pros: Cost-effective logging, flexible prompt definitions via code, supports function calls locally, can be hosted on-premise.
- Cons: Limited prompt playground features, limited multimodal support, fewer UI-based prompt management features.
LLM Evaluation Platform Feature Comparison
When to Build a Custom AI Evaluation Pipeline
Unique requirements, such as real-time audio processing or advanced tool usage for complex agent workflows, may not be met by existing tools. For example, if you require human-readable formats or the ability to view screenshots of the agent’s interactions, a custom evaluation pipeline becomes necessary. In such cases, you might be better off building a custom evaluation pipeline, or running two separate evaluation systems, e.g. using Langfuse’s model-based evaluations feature for production monitoring, while doing your development evaluations in a separate custom workflow. Doing so might also help you avoid high fees associated with some of the more expensive tools.
Key Takeaways
Many tools share similar "table stakes" workflows, but have different levels of depth based on their initial target audience. When choosing a tool, it helps to understand which audience you care most about — developer vs non-technical, development vs production evals, enterprise vs open-source.
Different tools offer vastly different levels of customizability. If your application needs advanced features like multimodal inputs or complex agent workflows, existing tools may not suffice. Be ready to build custom solutions or extend what's available.
Define evaluation criteria based on real data. Evaluating LLM output is hard and often a bottleneck for teams scaling AI products. A common mistake is using off-the-shelf evaluation metrics without looking at actual outputs. This leads to irrelevant criteria, wasted effort, and frustration.
Don't ignore production evaluations. Pre-deployment tests aren't enough. You need to monitor your models in production to catch issues that only emerge under real-world conditions. This is often when it starts becoming useful to adopt a third party tool.