Skip to main content

Testing & Evaluation Recipes

Retrieval Augmented Generation (RAG)

  • Q&A System Correctness: evaluate your retrieval-augmented Q&A pipeline end-to-end on a dataset. Iterate, improve, and keep testing.
  • Evaluating Q&A Systems with Dynamic Data: use evaluators that dereference a labels to handle data that changes over time.
  • RAG Evaluation using Fixed Sources: evaluate the response component of a RAG (retrieval-augmented generation) pipeline by providing retrieved documents in the dataset
  • RAG evaluation with RAGAS: evaluate RAG pipelines using the RAGAS framework. Covers metrics for both the generator AND retriever in both labeled and reference-free contexts (answer correctness, faithfulness, context relevancy, recall and precision).

Chat Bots

  • Chat Bot Evals using Simulated Users: evaluate your chat bot using a simulated user. The user is given a task, and you score your assistant based on how well it helps without being breaking its instructions.
  • Single-turn evals: Evaluate chatbots within multi-turn conversations by treating each data point as an individual dialogue turn. This guide shows how to set up a multi-turn conversation dataset and evaluate a simple chat bot on it.

Extraction

  • Evaluating an Extraction Chain: measure the similarity between the extracted structured content and structured labels using LangChain's json evaluators.
  • Exact Match: deterministic comparison of your system output against a reference label.

Agents

  • Evaluating an Agent's intermediate steps: compare the sequence of actions taken by an agent to an expected trajectory to grade effective tool use.
  • Tool Selection: Evaluate the precision of selected tools. Include an automated prompt writer to improve the tool descriptions based on failure cases.

Multimodel

Fundamentals

  • Adding Metrics to Existing Tests: Apply new evaluators to existing test results without re-running your model, using the compute_test_metrics utility function. This lets you evaluate "post-hoc" and backfill metrics as you define new evaluators.
  • Production Candidate Testing: benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
  • Naming Test Projects: manually name your tests with run_on_dataset(..., project_name='my-project-name')
  • Exporting Tests to CSV: Use the get_test_results beta utility to easily export your test results to a CSV file. This allows you to analyze and report on the performance metrics, errors, runtime, inputs, outputs, and other details of your tests outside of the Langsmith platform.
  • How to download feedback and examples from a test project: goes beyond the utility described above to query and export the predictions, evaluation results, and other information to programmatically add to your reports.

Was this page helpful?


You can leave detailed feedback on GitHub.