LLM Evaluation Tooling - A Review

Josh
November 12, 2024

Building a great GenAI app requires generating high-quality AI responses for a large volume of custom user inputs. This means developers need a good system for running evaluations during both development and production. Here are my learnings from looking at dozens of implementations.

Clarify Evaluation Goals: Set the key metrics that align with your application's objectives. Having clear goals will guide tool selection and evaluation design. You may also want to find a “Principal Domain Expert” whose judgment is crucial for the success of your AI product.

Choose the Right Tool for Your Team: Align the tool's capabilities with your team's expertise and workflow. For developer-centric teams, code-first tools like LangSmith or Langfuse may be preferable; if you’re collaborating with non-technical subject matter experts, you may find e.g. Braintrust serves your needs better.

Leverage AI for Efficiency: Use an LLM-as-a-Judge approach to scale qualitative evaluations effectively. Some tools even offer features that let you use AI to generate datasets and evaluation prompts, saving time and resources.

The Landscape of AI Evaluation Tools

There’s a HUGE number of LLM eval tools - I will focus on the ones with the highest adoption and most convincing offering.

Most tools are built to support straightforward workflows that look something like the below:

  1. Define a prompt.
  2. Define a test dataset.
  3. Define an evaluation metric.
  4. Run experiment: Run the prompt on the dataset with your eval metric.
  5. Log production runs with metadata like prompt versions, latency, and cost.
  6. Analyze results and identify areas for improvement.

We can roughly segment eval tools along three axes. Many top tools are expanding beyond their original focus, but it still helps to understand where each tool got its start:

  • Dev vs. GUI tools: Many tools target developers, where prompts and evaluations are defined in code. Others target a less technical audience, offering prompts and evaluations through a user interface.
  • Development vs. deployment tools: Another divide is between tools designed for development evaluations and those for production evaluations. Development eval tools help you develop prompts with evaluations before deployment. They are less useful for developers who can build this workflow directly in code. Production eval tools let you evaluate prompts after deployment, which is useful for both developers and non-technical users.
  • Enterprise vs. startup tools: Finally, some tools are built for enterprise clients with role-based access control, audit logs, etc. whereas others target smaller teams and solo developers who crave flexibility, e.g. with a focus on open-source libraries and documentation.
Image Description

Key Considerations before Adopting an LLM Evaluation Tool

When selecting between different options, the below are some of the most important factors to think about.

User interface: The best tools offer intuitive interfaces accessible to both developers and non-technical team members. An efficient prompt playground allows for quick iterations, enabling non-developers to compare prompts side-by-side, view inputs and outputs in various formats (such as markdown or audio), and compute evaluations directly within the playground. On the other hand, a flexible SDK/API with good documentation lets developers integrate and customize this as needed.

Customization: The more advanced your LLM app, the more flexibility you'll want for customization. Beyond the table stakes of having prompts with variable substitution, you might also want prompts with if/then logic, tool use like RAG or API calls, multimodal input/output. Aside from prompts, you may also want evals defined as prompts or code, online evals run on your code, alerting and graphing capabilities, etc. Most successful teams eventually need a custom eval pipeline (e.g. 30+ eval dimensions are not uncommon for clinical aspects in healthcare applications). Customizability also lets you adopt new foundation model releases without depending on the tool provider to update their tool to serve your use cases.

Scalability: Your tooling should handle large volumes of data without significant performance degradation or increased latency. This often motivates buying a tool instead of building internally. At the same time, as usage grows, the pricing model should remain sustainable, especially when logging and tracing are involved.

Integration: Smooth integration with existing development workflows and rapid adoption of new LLM features are important. Your tool should offer straightforward mechanisms for logging model interactions and tracing outputs back to their origins, ensuring compatibility with current systems. Flexibility in choosing different LLM judges is also crucial. This lets you experiment with various evaluators, including fine-tuned ones, and helps prevent self-bias in your evals.

Community and Support: An active community on Discord, Github, etc. can be a valuable resource for shared knowledge and troubleshooting. Comprehensive documentation with clear guides and examples facilitates smooth onboarding and effective use of the tool's features. Enterprise solutions may lack collaborative communities but provide robust enterprise support.

Below is a comparison of three popular LLM evaluation tools, highlighting their strengths and limitations.

TL;DR

🦜🔗 LangSmith:

  • Pros: Highly flexible with code-first prompt and eval definitions, robust prompt playground, extensive evaluation metric support.
  • Cons: Expensive pricing for high-volume tracing, no support for audio, limited tool use flexibility.

🗨️ Braintrust:

  • Pros: Excellent UI for non-coders, robust prompt management, flexible evaluation options, cost-effective for larger volumes.
  • Cons: Limited support for complex tool integrations and audio, some eval UI limitations.

🪢 Langfuse:

  • Pros: Cost-effective logging, flexible prompt definitions via code, supports function calls locally, can be hosted on-premise.
  • Cons: Limited prompt playground features, limited multimodal support, fewer UI-based prompt management features.

LLM Evaluation Platform Feature Comparison

Feature Category Feature 🦜🔗 LangSmith 🗨️ Braintrust 🪢 Langfuse
1. Prompt Management
💬
Define Prompts in Code while Logging Traces ✓ Supports defining prompts in code with logging traces ✓ Prompts defined locally in TypeScript with logging through proxy or local execution ✓ Defines prompts locally using Langfuse wrapper of OpenAI SDK
Define Prompts in UI for Non-Developers ✓ Allows defining prompts in UI ✓ Robust UI for prompt creation and management ✓ Provides a UI for defining prompts, more limited compared to Braintrust
Seamless Prompt Playground to Compare Prompts Side-by-Side ✓ Great prompt playground with variables, function calls; can run evals seamlessly ✓ Incredible prompt playground allowing multiple prompts against the same dataset with caching ✗ Limited prompt playground; cannot run across multiple prompts or datasets seamlessly
Support Complex Tool Use (e.g., RAG) with Local Tool Calls ✗ Does not support complex tool imports ✗ Limited support for tool imports; uses old OpenAI API fields ✓ Can call tools locally, offering flexibility in handling complex operations
Utilize New OpenAI/Claude Feature Releases Without Delay ✓ Code-first approach allows immediate feature use ✗ Hosted prompts may delay adoption of new features ✓ Code-first approach allows immediate feature use
View Inputs/Outputs in Multiple Formats ✓ (Markdown only; audio not supported)
Version Control; Save to New Named Prompt Version
Compare Multiple Prompts Side-by-Side
Support Multimodal Inputs/Outputs
Run Prompts on Manually Specified Inputs or Datasets
2. Dataset Management
💽
Define Datasets with Input and Expected Output ✓ Supports defining datasets with inputs and outputs ✓ Define datasets manually, via CSV/JSON upload, API, or from production logs ✓ Supports manual entry and adding via API
Add Examples via SDK or CSV/JSON Upload
Add Production Examples from Logged Traces
Allow Input and Output to Be Dictionaries for Multi-Span Traces
Use LLM to Generate Dataset Examples Based on Original Prompt and Past Logs
3. Evaluation Metrics
📈
Support LLM-as-Judge Evaluation Metrics ✓ Supports LLM-based evals with high flexibility ✓ Supports LLM-based evals and allows creating custom evaluators
Support Custom LLM Judge Models
Support Code-Based Evaluation Metrics
Run Metrics Against a Gold Standard When Editing
Use LLM to Suggest Eval Prompts Based on Original Prompt
4. Experiment Management
🧪
Compare Example’s Outputs Between Different Prompts (Multiline/Multimodal)
Compare Aggregated Metrics (Average Score, Score Distribution, Cost, Latency) Between Prompts
Retrieve Experiment Results Locally for Flexible Analysis
Pairwise Compare Two Experiment Results via LLM
Cache Run Results to Save Time/Cost
Run Multi-Stage Prompt Chains (e.g., RAG, Tool Use, Multi-Turn Chat) with Evals & Expected Values
Allow Leaving Comments on Prompts, Datasets, Experiments, and Traces
Human Labeling / Review UI
5. Logging and Tracing
🐾
Log All Executions of Prompts (Spans) or Chains (Traces)
Record Metadata for Each Execution, Including Prompt Version, Latency, Cost, etc.
Full Audit History
6. Online Evaluation and Monitoring
💻
Online Evaluation Methods: Random Sampling, Heuristic-Based, Manually Selected, User-Flagged
Alert on Bad Evals
Alert When Eval Score Drops Below Threshold
Mark Examples for Human Labeling/Review or to Add to Dataset
7. Pricing
💰
Unit of measurement Base traces: $0.50/1K
Extended traces: $5/1K
Rows/week: $0/1000 Observations: $10 / 100k
Free tier 1 user
5K traces free
1000 rows/week
Up to 5 users
50K observations free
Plus tier Up to 10 users
First 10K traces included
$39/user per month
Unlimited users
Unlimited private experiments
Custom pricing
Unlimited users
100K observations included
$59/user per month
Enterprise tier Custom pricing Custom pricing Starts at $499

When to Build a Custom AI Evaluation Pipeline

Unique requirements, such as real-time audio processing or advanced tool usage for complex agent workflows, may not be met by existing tools. For example, if you require human-readable formats or the ability to view screenshots of the agent’s interactions, a custom evaluation pipeline becomes necessary. In such cases, you might be better off building a custom evaluation pipeline, or running two separate evaluation systems, e.g. using Langfuse’s model-based evaluations feature for production monitoring, while doing your development evaluations in a separate custom workflow. Doing so might also help you avoid high fees associated with some of the more expensive tools.

Key Takeaways

Many tools share similar "table stakes" workflows, but have different levels of depth based on their initial target audience. When choosing a tool, it helps to understand which audience you care most about — developer vs non-technical, development vs production evals, enterprise vs open-source.

Different tools offer vastly different levels of customizability. If your application needs advanced features like multimodal inputs or complex agent workflows, existing tools may not suffice. Be ready to build custom solutions or extend what's available.

Define evaluation criteria based on real data. Evaluating LLM output is hard and often a bottleneck for teams scaling AI products. A common mistake is using off-the-shelf evaluation metrics without looking at actual outputs. This leads to irrelevant criteria, wasted effort, and frustration.

Don't ignore production evaluations. Pre-deployment tests aren't enough. You need to monitor your models in production to catch issues that only emerge under real-world conditions. This is often when it starts becoming useful to adopt a third party tool.