evaluate#

langsmith.evaluation._runner.evaluate(

target: TARGET_T | Runnable | EXPERIMENT_T | tuple[EXPERIMENT_T, EXPERIMENT_T],

/,

data: DATA_T | None = None,

evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None,

summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,

metadata: dict | None = None,

experiment_prefix: str | None = None,

description: str | None = None,

max_concurrency: int | None = 0,

num_repetitions: int = 1,

client: langsmith.Client | None = None,

blocking: bool = True,

experiment: EXPERIMENT_T | None = None,

upload_results: bool = True,

error_handling: Literal['log', 'ignore'] = 'log',

**kwargs: Any,

) → ExperimentResults | ComparativeExperimentResults[source]#

Evaluate a target system on a given dataset.

Parameters:

target (TARGET_T | Runnable | EXPERIMENT_T | Tuple[EXPERIMENT_T, EXPERIMENT_T]) – The target system or experiment(s) to evaluate. Can be a function that takes a dict and returns a dict, a langchain Runnable, an existing experiment ID, or a two-tuple of experiment IDs.
data (DATA_T) – The dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples.
evaluators (Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None) – A list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None.
summary_evaluators (Sequence[SUMMARY_EVALUATOR_T] | None) – A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments. Defaults to None.
metadata (dict | None) – Metadata to attach to the experiment. Defaults to None.
experiment_prefix (str | None) – A prefix to provide for your experiment name. Defaults to None.
description (str | None) – A free-form text description for the experiment.
max_concurrency (int | None) – The maximum number of concurrent evaluations to run. If None then no limit is set. If 0 then no concurrency. Defaults to 0.
client (langsmith.Client | None) – The LangSmith client to use. Defaults to None.
blocking (bool) – Whether to block until the evaluation is complete. Defaults to True.
num_repetitions (int) – The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.
experiment (schemas.TracerSession | None) – An existing experiment to extend. If provided, experiment_prefix is ignored. For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
load_nested (bool) – Whether to load all child runs for the experiment. Default is to only load the top-level root runs. Should only be specified when target is an existing experiment or two-tuple of experiments.
randomize_order (bool) – Whether to randomize the order of the outputs for each evaluation. Default is False. Should only be specified when target is a two-tuple of existing experiments.
error_handling (str, default="log") – How to handle individual run errors. ‘log’ will trace the runs with the error message as part of the experiment, ‘ignore’ will not count the run as part of the experiment at all.
upload_results (bool)
kwargs (Any)

Returns:

If target is a function, Runnable, or existing experiment. ComparativeExperimentResults: If target is a two-tuple of existing experiments.

Return type:

ExperimentResults

Examples

Prepare the dataset:

>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
...     "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"

Basic usage:

>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}
>>> def predict(inputs: dict) -> dict:
...     # This can be any function or just an API call to your app.
...     return {"output": "Yes"}
>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     experiment_prefix="My Experiment",
...     description="Evaluating the accuracy of a simple prediction model.",
...     metadata={
...         "my-prompt-version": "abcd-1234",
...     },
... )
View the evaluation results for experiment:...

Evaluating over only a subset of the examples

>>> experiment_name = results.experiment_name
>>> examples = client.list_examples(dataset_name=dataset_name, limit=5)
>>> results = evaluate(
...     predict,
...     data=examples,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     experiment_prefix="My Experiment",
...     description="Just testing a subset synchronously.",
... )
View the evaluation results for experiment:...

Streaming each prediction to more easily + eagerly debug.

>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     description="I don't even have to block!",
...     blocking=False,
... )
View the evaluation results for experiment:...
>>> for i, result in enumerate(results):
...     pass

Using the evaluate API with an off-the-shelf LangChain evaluator:

>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> def prepare_criteria_data(run: Run, example: Example):
...     return {
...         "prediction": run.outputs["output"],
...         "reference": example.outputs["answer"],
...         "input": str(example.inputs),
...     }
>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[
...         accuracy,
...         LangChainStringEvaluator("embedding_distance"),
...         LangChainStringEvaluator(
...             "labeled_criteria",
...             config={
...                 "criteria": {
...                     "usefulness": "The prediction is useful if it is correct"
...                     " and/or asks a useful followup question."
...                 },
...                 "llm": ChatOpenAI(model="gpt-4o"),
...             },
...             prepare_data=prepare_criteria_data,
...         ),
...     ],
...     description="Evaluating with off-the-shelf LangChain evaluators.",
...     summary_evaluators=[precision],
... )
View the evaluation results for experiment:...

Evaluating a LangChain object:

>>> from langchain_core.runnables import chain as as_runnable
>>> @as_runnable
... def nested_predict(inputs):
...     return {"output": "Yes"}
>>> @as_runnable
... def lc_predict(inputs):
...     return nested_predict.invoke(inputs)
>>> results = evaluate(
...     lc_predict.invoke,
...     data=dataset_name,
...     evaluators=[accuracy],
...     description="This time we're evaluating a LangChain object.",
...     summary_evaluators=[precision],
... )
View the evaluation results for experiment:...

Changed in version 0.2.0: ‘max_concurrency’ default updated from None (no limit on concurrency) to 0 (no concurrency at all).