Skip to main content

Add Metrics to Existing Tests

Open In Collab Open In GitHub

At times, you may want to apply an evaluator post-hoc. This is useful if you have a new evaluator (or version of an evaluator) and want to add the metrics without re-running your model.

You can do this like so:

from langsmith.beta import compute_test_metrics

def my_evaluator(run, example):
score = "foo" in run.outputs['output']
return {"key": "is_foo", "score": score}

# The name of the test you have already run.
# This is DISTINCT from the dataset name
test_project = "test-abc123"
compute_test_metrics(test_project, evaluators=[my_evaluator])

Within the compute_test_metrics function, we list the runs in the test and apply the provided evaluators to each one.

Below, we will share a quick example.

Prerequisites

Install the requisite packages, and generate the initial test results. In reality, you will already have a dataset + test results.

This utility function expects langsmith>=0.1.31.

# %pip install -U langsmith langchain
import os
import uuid

os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# Update if you are self-hosted
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
from langsmith import Client

client = Client()
dataset_name = "My Example Dataset " + uuid.uuid4().hex[:6]

ds = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
inputs=[{"input": i} for i in range(10)],
outputs=[{"output": i * (3 % (i + 1))} for i in range(10)],
dataset_id=ds.id,
)


def my_chain(example_input: dict):
# The input to the llm_or_chain_factory is
# the example.inputs
return {"output": example_input["input"] * 3}


results = client.run_on_dataset(
dataset_name=dataset_name, llm_or_chain_factory=my_chain
)

test_name = results["project_name"]
View the evaluation results for project 'puzzled-cloud-96' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cbdb128b-a725-4662-a515-dfe0009cb15c/compare?selectedSessions=28f2c88e-3091-4fcc-bac7-c1dbd8a6a43b

View all tests for Dataset My Example Dataset 512ee7 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cbdb128b-a725-4662-a515-dfe0009cb15c
[------------------------------------------------->] 10/10

Add Evaluation Metrics

Now that we have existing test results, we can apply new evaluators to this project using the compute_test_metrics utility function.

from langsmith.beta._evals import compute_test_metrics
from langsmith.schemas import Example, Run


def exact_match(run: Run, example: Example):
# "output" is the key we assigned in the create_examples step above
expected = example.outputs["output"]
predicted = run.outputs["output"]
return {"key": "exact_match", "score": predicted == expected}


# The name of the test you have already run.
# This is DISTINCT from the dataset name
compute_test_metrics(test_name, evaluators=[exact_match])
/var/folders/gf/6rnp_mbx5914kx7qmmh7xzmw0000gn/T/ipykernel_80329/988510393.py:14: UserWarning: Function compute_test_metrics is in beta.
compute_test_metrics(test_name, evaluators=[exact_match])

Now you can check out the test results in the above link.

Conclusion

Congrats! You've run evals on an existing test. This makes it easy to backfill evaluation results on old test results.


Was this page helpful?


You can leave detailed feedback on GitHub.