Evaluation how-to guides
These guides answer “How do I…?” format questions. They are goal-oriented and concrete, and are meant to help you complete a specific task. For conceptual explanations see the Conceptual guide. For end-to-end walkthroughs see Tutorials. For comprehensive descriptions of every class and function see the API reference.
Offline evaluation
Evaluate and improve your application before deploying it.
Run an evaluation
- Run an evaluation with the SDK
- Run an evaluation asynchronously
- Run an evaluation comparing two experiments
- Evaluate a
langchain
runnable - Evaluate a
langgraph
graph - Evaluate an existing experiment (Python only)
- Run an evaluation from the UI
- Run an evaluation via the REST API
- Run an evaluation with large file inputs
Define an evaluator
- Define a custom evaluator
- Define an LLM-as-a-judge evaluator
- Define a pairwise evaluator
- Define a summary evaluator
- Use an off-the-shelf evaluator via the SDK (Python only)
- Evaluate an application's intermediate steps
- Return multiple metrics in one evaluator
- Return categorical vs numerical metrics
Configure the evaluation data
Configure an evaluation job
- Evaluate with repetitions
- Handle model rate limits
- Print detailed logs (Python only)
- Run an evaluation locally (beta, Python only)
Testing integrations
Run evals using your favorite testing tools:
Online evaluation
Evaluate and monitor your system's live performance on production data.
Automatic evaluation
Set up evaluators that automatically run for all experiments against a dataset.
Analyzing experiment results
Use the UI & API to understand your experiment results.
- Compare experiments with the comparison view
- Filter experiments
- View pairwise experiments
- Fetch experiment results in the SDK
- Upload experiments run outside of LangSmith with the REST API
Dataset management
Manage datasets in LangSmith used by your evaluations.
- Create a dataset from the UI
- Export a dataset from the UI
- Create a dataset split from the UI
- Filter examples from the UI
- Create a dataset with the SDK
- Fetch a dataset with the SDK
- Update a dataset with the SDK
- Version a dataset
- Share/unshare a dataset publicly
- Export filtered traces from an experiment to a dataset
Annotation queues and human feedback
Collect feedback from subject matter experts and users to improve your applications.