evaluate#

langsmith.evaluation._runner.evaluate(target: TARGET_T | Runnable | EXPERIMENT_T | Tuple[EXPERIMENT_T, EXPERIMENT_T], /, data: DATA_T | None = None, evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None, summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None, metadata: dict | None = None, experiment_prefix: str | None = None, description: str | None = None, max_concurrency: int | None = 0, num_repetitions: int = 1, client: langsmith.Client | None = None, blocking: bool = True, experiment: EXPERIMENT_T | None = None, upload_results: bool = True, **kwargs: Any) ExperimentResults | ComparativeExperimentResults[source]#

Evaluate a target system on a given dataset.

Parameters:
  • target (TARGET_T | Runnable | EXPERIMENT_T | Tuple[EXPERIMENT_T, EXPERIMENT_T]) – The target system or experiment(s) to evaluate. Can be a function that takes a dict and returns a dict, a langchain Runnable, an existing experiment ID, or a two-tuple of experiment IDs.

  • data (DATA_T) – The dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples.

  • evaluators (Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None) – A list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None.

  • summary_evaluators (Sequence[SUMMARY_EVALUATOR_T] | None) – A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments. Defaults to None.

  • metadata (dict | None) – Metadata to attach to the experiment. Defaults to None.

  • experiment_prefix (str | None) – A prefix to provide for your experiment name. Defaults to None.

  • description (str | None) – A free-form text description for the experiment.

  • max_concurrency (int | None) – The maximum number of concurrent evaluations to run. If None then no limit is set. If 0 then no concurrency. Defaults to 0.

  • client (langsmith.Client | None) – The LangSmith client to use. Defaults to None.

  • blocking (bool) – Whether to block until the evaluation is complete. Defaults to True.

  • num_repetitions (int) – The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.

  • experiment (schemas.TracerSession | None) – An existing experiment to extend. If provided, experiment_prefix is ignored. For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.

  • load_nested (bool) – Whether to load all child runs for the experiment. Default is to only load the top-level root runs. Should only be specified when target is an existing experiment or two-tuple of experiments.

  • randomize_order (bool) – Whether to randomize the order of the outputs for each evaluation. Default is False. Should only be specified when target is a two-tuple of existing experiments.

  • upload_results (bool)

  • kwargs (Any)

Returns:

If target is a function, Runnable, or existing experiment. ComparativeExperimentResults: If target is a two-tuple of existing experiments.

Return type:

ExperimentResults

Examples

Prepare the dataset:

>>> from typing import Sequence
>>> from langsmith import Client
>>> from langsmith.evaluation import evaluate
>>> from langsmith.schemas import Example, Run
>>> client = Client()
>>> dataset = client.clone_public_dataset(
...     "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d"
... )
>>> dataset_name = "Evaluate Examples"

Basic usage:

>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}
>>> def predict(inputs: dict) -> dict:
...     # This can be any function or just an API call to your app.
...     return {"output": "Yes"}
>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     experiment_prefix="My Experiment",
...     description="Evaluating the accuracy of a simple prediction model.",
...     metadata={
...         "my-prompt-version": "abcd-1234",
...     },
... )  
View the evaluation results for experiment:...

Evaluating over only a subset of the examples

>>> experiment_name = results.experiment_name
>>> examples = client.list_examples(dataset_name=dataset_name, limit=5)
>>> results = evaluate(
...     predict,
...     data=examples,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     experiment_prefix="My Experiment",
...     description="Just testing a subset synchronously.",
... )  
View the evaluation results for experiment:...

Streaming each prediction to more easily + eagerly debug.

>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[accuracy],
...     summary_evaluators=[precision],
...     description="I don't even have to block!",
...     blocking=False,
... )  
View the evaluation results for experiment:...
>>> for i, result in enumerate(results):  
...     pass

Using the evaluate API with an off-the-shelf LangChain evaluator:

>>> from langsmith.evaluation import LangChainStringEvaluator
>>> from langchain_openai import ChatOpenAI
>>> def prepare_criteria_data(run: Run, example: Example):
...     return {
...         "prediction": run.outputs["output"],
...         "reference": example.outputs["answer"],
...         "input": str(example.inputs),
...     }
>>> results = evaluate(
...     predict,
...     data=dataset_name,
...     evaluators=[
...         accuracy,
...         LangChainStringEvaluator("embedding_distance"),
...         LangChainStringEvaluator(
...             "labeled_criteria",
...             config={
...                 "criteria": {
...                     "usefulness": "The prediction is useful if it is correct"
...                     " and/or asks a useful followup question."
...                 },
...                 "llm": ChatOpenAI(model="gpt-4o"),
...             },
...             prepare_data=prepare_criteria_data,
...         ),
...     ],
...     description="Evaluating with off-the-shelf LangChain evaluators.",
...     summary_evaluators=[precision],
... )  
View the evaluation results for experiment:...

Evaluating a LangChain object:

>>> from langchain_core.runnables import chain as as_runnable
>>> @as_runnable
... def nested_predict(inputs):
...     return {"output": "Yes"}
>>> @as_runnable
... def lc_predict(inputs):
...     return nested_predict.invoke(inputs)
>>> results = evaluate(
...     lc_predict.invoke,
...     data=dataset_name,
...     evaluators=[accuracy],
...     description="This time we're evaluating a LangChain object.",
...     summary_evaluators=[precision],
... )  
View the evaluation results for experiment:...

Changed in version 0.2.0: β€˜max_concurrency’ default updated from None (no limit on concurrency) to 0 (no concurrency at all).