evaluate_existing#
- langsmith.evaluation._runner.evaluate_existing(
- experiment: str | UUID | TracerSession,
- /,
- evaluators: Sequence[RunEvaluator | Callable[[Run, Example | None], EvaluationResult | EvaluationResults] | Callable[[...], dict | EvaluationResults | EvaluationResult]] | None = None,
- summary_evaluators: Sequence[Callable[[Sequence[Run], Sequence[Example]], EvaluationResult | EvaluationResults] | Callable[[list[Run], list[Example]], EvaluationResult | EvaluationResults]] | None = None,
- metadata: dict | None = None,
- max_concurrency: int | None = 0,
- client: Client | None = None,
- load_nested: bool = False,
- blocking: bool = True,
Evaluate existing experiment runs.
- Parameters:
experiment (Union[str, uuid.UUID]) β The identifier of the experiment to evaluate.
data (DATA_T) β The data to use for evaluation.
evaluators (Optional[Sequence[EVALUATOR_T]]) β Optional sequence of evaluators to use for individual run evaluation.
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]) β Optional sequence of evaluators to apply over the entire dataset.
metadata (Optional[dict]) β Optional metadata to include in the evaluation results.
max_concurrency (int | None) β The maximum number of concurrent evaluations to run. If None then no limit is set. If 0 then no concurrency. Defaults to 0.
client (Optional[langsmith.Client]) β Optional Langsmith client to use for evaluation.
load_nested (bool) β Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
blocking (bool) β Whether to block until evaluation is complete.
- Returns:
The evaluation results.
- Return type:
- Environment:
- LANGSMITH_TEST_CACHE: If set, API calls will be cached to disk to save time and
cost during testing. Recommended to commit the cache files to your repository for faster CI/CD runs. Requires the βlangsmith[vcr]β package to be installed.
Examples
Define your evaluators
>>> from typing import Sequence >>> from langsmith.schemas import Example, Run >>> def accuracy(run: Run, example: Example): ... # Row-level evaluator for accuracy. ... pred = run.outputs["output"] ... expected = example.outputs["answer"] ... return {"score": expected.lower() == pred.lower()} >>> def precision(runs: Sequence[Run], examples: Sequence[Example]): ... # Experiment-level evaluator for precision. ... # TP / (TP + FP) ... predictions = [run.outputs["output"].lower() for run in runs] ... expected = [example.outputs["answer"].lower() for example in examples] ... # yes and no are the only possible answers ... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"]) ... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)]) ... return {"score": tp / (tp + fp)}
Load the experiment and run the evaluation.
>>> import uuid >>> from langsmith import Client >>> from langsmith.evaluation import evaluate, evaluate_existing >>> client = Client() >>> dataset_name = "__doctest_evaluate_existing_" + uuid.uuid4().hex[:8] >>> dataset = client.create_dataset(dataset_name) >>> example = client.create_example( ... inputs={"question": "What is 2+2?"}, ... outputs={"answer": "4"}, ... dataset_id=dataset.id, ... ) >>> def predict(inputs: dict) -> dict: ... return {"output": "4"} >>> # First run inference on the dataset ... results = evaluate( ... predict, data=dataset_name, experiment_prefix="doctest_experiment" ... ) View the evaluation results for experiment:... >>> experiment_id = results.experiment_name >>> # Wait for the experiment to be fully processed and check if we have results >>> len(results) > 0 True >>> import time >>> time.sleep(2) >>> results = evaluate_existing( ... experiment_id, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... ) View the evaluation results for experiment:... >>> client.delete_dataset(dataset_id=dataset.id)