Testing & Evaluation
Many challenges hinder the creation of a high-quality, production-grade LLM applications, including:
- Non-deterministic Outputs: models are probabilistic and can produce different outputs for the same prompt (even with a 0 temperature setting as model weights are not guaranteed to be static over time).
- API opacity: Models behind APIs change over time
- Security: LLMs are vulnerable to prompt injections
- Bias: LLMs encode biases that can create negative experiences
- Cost: State-of-the-art models can be expensive
- Latency: Most experiences need to be fast
Testing and evaluation help expose issues, so you can decide how to best address them, be that through different design choices, better models or prompts, additional code checks, or other means.
We don't have all the answers! We provide a number of tools and reference materials to help you get started, including automated evaluators of various types that help detect issues you face when scaling up LLM applications. No single tool is perfect, and we want to encourage experimentation and innovation in this space by providing these tools and guides as a starting point. We welcome your feedback and collaboration!
Check out the Quick Start Guide for a brief walkthrough evaluating your chain or LLM, or read on for more details.
For a higher-level set of recommendations on how to think about testing and evaluating your LLM app, check out the evaluation recommendations page.
📄️ Overview
Many challenges hinder the creation of a high-quality, production-grade LLM applications, including:
📄️ Quick Start
In this walkthrough, you will evaluate a chain over a dataset of examples. To do so, you will:
📄️ Datasets
Datasets are a collections of examples that can be used to evaluate or otherwise improve a chain, agent, or model. Examples are rows in the dataset, containing the inputs and (optionally) expected outputs for a given interaction. Below we will go over the current types of datasets as well as different ways to create them.
📄️ LangChain Evaluators
LangChain's evaluation module provides evaluators you can use as-is for common evaluation scenarios.
📄️ Custom Evaluators
In this guide, you will create a custom string evaluator for your agent. You can choose to use LangChain components or write your own custom evaluator from scratch.
📄️ Feedback
This guide will walk you through feedback in LangSmith. For more end-to-end examples incorporating feedback into a workflow, see the LangSmith Cookbook.
🗃️ Additional Resources
2 items