Skip to main content

Testing & Evaluation Recipes

Retrieval Augmented Generation (RAG)

  • Q&A System Correctness: evaluate your retrieval-augmented Q&A pipeline end-to-end on a dataset. Iterate, improve, and keep testing.
  • Evaluating Q&A Systems with Dynamic Data: use evaluators that dereference a labels to handle data that changes over time.
  • RAG Evaluation using Fixed Sources: evaluate the response component of a RAG (retrieval-augmented generation) pipeline by providing retrieved documents in the dataset
  • RAG evaluation with RAGAS: evaluate RAG pipelines using the RAGAS framework. Covers metrics for both the generator AND retriever in both labeled and reference-free contexts (answer correctness, faithfulness, context relevancy, recall and precision).

Chat Bots

  • Chat Bot Evals using Simulated Users: evaluate your chat bot using a simulated user. The user is given a task, and you score your assistant based on how well it helps without being breaking its instructions.
  • Single-turn evals: Evaluate chatbots within multi-turn conversations by treating each data point as an individual dialogue turn. This guide shows how to set up a multi-turn conversation dataset and evaluate a simple chat bot on it.

Extraction

  • Evaluating an Extraction Chain: measure the similarity between the extracted structured content and structured labels using LangChain's json evaluators.
  • Exact Match: deterministic comparison of your system output against a reference label.

Agents

  • Evaluating an Agent's intermediate steps: compare the sequence of actions taken by an agent to an expected trajectory to grade effective tool use.
  • Tool Selection: Evaluate the precision of selected tools. Include an automated prompt writer to improve the tool descriptions based on failure cases.

Multimodel

Fundamentals

  • Adding Metrics to Existing Tests: Apply new evaluators to existing test results without re-running your model, using the compute_test_metrics utility function. This lets you evaluate "post-hoc" and backfill metrics as you define new evaluators.
  • Production Candidate Testing: benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
  • Naming Test Projects: manually name your tests with run_on_dataset(..., project_name='my-project-name')
  • Exporting Tests to CSV: Use the get_test_results beta utility to easily export your test results to a CSV file. This allows you to analyze and report on the performance metrics, errors, runtime, inputs, outputs, and other details of your tests outside of the Langsmith platform.
  • How to download feedback and examples from a test project: goes beyond the utility described above to query and export the predictions, evaluation results, and other information to programmatically add to your reports.

Was this page helpful?