Skip to main content

How to evaluate an existing experiment (Python only)

note

Currently, evaluate_existing is only supported in the Python SDK.

If you have already run an experiment and want to add additional evaluation metrics, you can apply any evaluators to the experiment using the evaluate_existing method.

from langsmith import evaluate_existing

def always_half(run, example):
return {"score": 0.5}

experiment_name = "my-experiment:abcd123" # Replace with an actual experiment name or ID
evaluate_existing(experiment_name, evaluators=[always_half])

Example

Suppose you are evaluating a semantic router. You may first run an experiment:

from langsmith import evaluate
def semantic_router(inputs: dict):
return {"class": 1}

def accuracy(run, example):
prediction = run.outputs["class"]
expected = example.outputs["label"]
return {"score": prediction == expected}

results = evaluate(semantic_router, data="Router Classification Dataset", evaluators=[accuracy])
experiment_name = results.experiment_name

Later, you realize you want to add precision and recall summary metrics. The evaluate_existing method accepts the same arguments as the evaluate method, replacing the target system with the experiment you wish to add metrics to, meaning you can add both instance-level evaluator's and aggregate summary_evaluator's.

from langsmith import evaluate_existing

def precision(runs: list, examples: list):
true_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] == example.outputs["label"]])
false_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] != example.outputs["label"]])
return {"score": true_positives / (true_positives + false_positives)}

def recall(runs: list, examples: list):
true_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] == example.outputs["label"]])
false_negatives = sum([1 for run, example in zip(runs, examples) if run.outputs["class"] != example.outputs["label"]])
return {"score": true_positives / (true_positives + false_negatives)}

evaluate_existing(experiment_name, summary_evaluators=[precision, recall])

The precision and recall metrics will now be available in the LangSmith UI for the experiment_name experiment.

As is the case with the evaluate function, there is an identical, asynchronous aevaluate_existing function that can be used to evaluate experiments asynchronously.


Was this page helpful?


You can leave detailed feedback on GitHub.