Skip to main content

Evaluating Existing Runs

Open In Collab Open In GitHub

This tutorial shows how to evaluate runs in an existing test project. This is useful when:

  • You have a new evaluator or version of an evaluator and want to add the eval metrics to existing test projects
  • Your model isn't defined in python or typescript but you want to add evaluation metrics

The steps are:

  1. Select the test project you wish to evaluate

  2. Define the RunEvaluator

  3. Call the client.evaluate_run method, which runs the evaluation and logs the results as feedback.

    • alternatively, call client.create_feedback method directly, since evaluation results are logged as model feedback

This is all you need to start logging eval feedback to an existing project. Below, we will review how to list the runs to evaluate.

1. Select Test Project to evaluate

Each time you call run_on_dataset to evaluate a model, a new "test project" is created containing the model's runs and the evaluator feedback. Each run contains the inputs and outputs to the component as well as a reference to the dataset example (row) it came from. The test project URL and name is printed to stdout when the function is called.

The easiest way to find the test project name or ID is in the web app. Navigate to "Datasets & Testing", select a dataset, and then copy one of the project names from the test runs table. Below is an example of the Dataset & Testing page, with all the datasets listed out. We will select the "Chat Langchain Questions" dataset.

Datasets & Testing Page

Once you've selected one of the datasets, a list of test projects will be displayed. You can copy the project name from the table directly.

Test Projects

Or if you navigate to the test page, you can copy the project name from the title or the ID from the url.

Test Page

Then once you have the project name or ID, you can list the runs to evaluate by calling list_runs.

import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Update with your API key
project_name = "YOUR PROJECT NAME" # Update with your project name
from langsmith import Client

client = Client()

# Copy the project name or ID and paste it in the corresponding field below
runs = client.list_runs(
project_name = project_name,
# Or by ID
# project_id = "0fc4f999-bdd3-4a7e-b2d7-bdf837d57cd9",
execution_order = 1,
)

Since this is a test project, each run will have a reference to the dataset example, meaning you can apply a labeled evaluator such as the cot_qa evaluator to these runs.

2. Define Evaluator

You may already know what you want to test to ensure you application is functioning as expected. In that case, you can easily add that logic to a custom evaluator to get started. You can also configure one of LangChain's off-the-shelf evaluators to use to test for things like correctness, helpfulness, embedding or string distance, or other metrics. For more information on some of the existing open source evaluators, check out the documentation.

Custom Evaluator

You can add automated/algorithmic feedback to existing runs using just the SDK in two steps:

  1. Subclassing the RunEvaluator class and implementing the evaluate_run method
  2. Calling the evaluate_run method directly on the client

The evaluate_run method loads a reference example if present, applies the evaluator to the run and optional example, and then logs the feedback to LangSmith. Below, create a custom evaluator that checks for any digits in the prediction.

from typing import Optional

from langsmith.evaluation import EvaluationResult, RunEvaluator
from langsmith.schemas import Example, Run


class ContainsDigits(RunEvaluator):

def evaluate_run(
self, run: Run, example: Optional[Example] = None
) -> EvaluationResult:
if run.outputs is None:
raise ValueError("Run outputs cannot be None")
prediction = str(next(iter(run.outputs.values())))
contains_digits = any(c.isdigit() for c in prediction)
print(f"Evaluating run: {run.id}")
return EvaluationResult(key="Contains Digits", score=contains_digits)

Our custom evaluator is a simple reference-free check for boolean presence of digits in the output. In your case you may want to check for PII, assert the result conforms to some schema, or even parse and compare generated code.

The logic fetching the prediction above assumes your chain only returns one value, meaning the run.outputs dictionary will have only one key. If there are multiple keys in your outputs, you will have to select whichever key(s) you wish to evaluate or test the whole outputs dictionary directly as a string. For more information on creating a custom evaluator, check out the docs.

Below, apply the evaluator to all runs in the "My Test" project.

import itertools

evaluator = ContainsDigits()
runs = client.list_runs(
project_name=project_name,
execution_order=1,
error=False,
)

for run in itertools.islice(runs, 5):
feedback = client.evaluate_run(run, evaluator)
Evaluting run:  bdd21374-8b04-4af1-b052-cd780274d8a4
Evaluting run: 77b6121e-3f5e-4e4c-a054-dfd2f51a2f8b
Evaluting run: f3f5eee4-6e9e-494b-b975-13edbcd79730
Evaluting run: 62ffea84-49ff-464f-8a65-fa4ecfd3c02e
Evaluting run: b33c3adc-aa51-4259-a0b0-1f8b52c8c135

The evaluation results will all be saved as feedback to the run trace. LangSmith aggregates the feedback over the project for you asynchronously, so after some time you will be able to see the feedback results directly on the project stats.

# Updating the aggregate stats is async, but after some time, the "Contains Digits" feedback will be available
client.read_project(project_name=project_name).feedback_stats
{'smog_index': {'n': 198, 'avg': 0.0, 'mode': 0, 'is_all_model': True},
'user_click': {'n': 13, 'avg': 1.0, 'mode': 1, 'is_all_model': False},
'completeness': {'n': 275,
'avg': 0.851063829787234,
'mode': 1,
'is_all_model': True},
'user_feedback': {'n': 6, 'avg': 0.75, 'mode': 0.5, 'is_all_model': False},
'Contains Digits': {'n': 945,
'avg': 0.638095238095238,
'mode': 1,
'is_all_model': True},
'sufficient_code': {'n': 220, 'avg': 1.0, 'mode': 1, 'is_all_model': True},
'coleman_liau_index': {'n': 198,
'avg': -0.30404040404040406,
'mode': 11.15,
'is_all_model': True},
'flesch_reading_ease': {'n': 198,
'avg': 82.80222222222223,
'mode': 59.97,
'is_all_model': True},
'flesch_kincaid_grade': {'n': 198,
'avg': 2.436868686868687,
'mode': 5.6,
'is_all_model': True},
'automated_readability_index': {'n': 198,
'avg': 4.785858585858586,
'mode': 9.9,
'is_all_model': True}}

LangChain evaluators

LangChain has a number of evaluators you can use off-the-shelf or modify to suit your needs. An easy way to use these is to modify the code above and apply the evaluator directly to the run. For more information on available LangChain evaluators, check out the open source documentation.

Below, we will demonstrate this by using the criteria evaluator, which instructs an LLM to check that the prediction against the described criteria. In this case, we will check that the responses contain both a python and typescript example, if needed, since LangSmith's SDK supports both languages.

from langchain import evaluation, callbacks

class SufficientCodeEvaluator(RunEvaluator):

def __init__(self):
criteria_description=(
"If the submission contains code, does it contain both a python and typescript example?"
" Y if no code is needed or if both languages are present, N if response is only in one language"
)
self.evaluator = evaluation.load_evaluator("criteria",
criteria={
"sufficient_code": criteria_description
})
def evaluate_run(
self, run: Run, example: Optional[Example] = None
) -> EvaluationResult:
question = next(iter(run.inputs.values()))
prediction = str(next(iter(run.outputs.values())))
print(f"Evaluating run: {run.id}")
with callbacks.collect_runs() as cb:
result = self.evaluator.evaluate_strings(input=question, prediction=prediction)
run_id = cb.traced_runs[0].id
return EvaluationResult(key="sufficient_code", evaluator_info={"__run": {"run_id": run_id}}, **result)

runs = client.list_runs(
project_name=project_name,
execution_order=1,
error=False,
)
evaluator = SufficientCodeEvaluator()
for run in itertools.islice(runs, 5):
feedback = client.evaluate_run(run, evaluator)
Evaluating run:  bdd21374-8b04-4af1-b052-cd780274d8a4
Evaluating run: 77b6121e-3f5e-4e4c-a054-dfd2f51a2f8b
Evaluating run: f3f5eee4-6e9e-494b-b975-13edbcd79730
Evaluating run: 62ffea84-49ff-464f-8a65-fa4ecfd3c02e
Evaluating run: b33c3adc-aa51-4259-a0b0-1f8b52c8c135

Conclusion

Congrats! You've run evals on an existing test project and logged feedback to the traces. Now, all the feedback results are aggregated on the project page, and you can use those to compare prompts and chains on a dataset.

If you have other related questions, feel free to create an issue in this repo!