Skip to main content

How to download feedback and examples from a test project

Open In Collab Open In GitHub

When testing with Langsmith, all the traces, examples, and evaluation feedback are saved so you have a full audit of what happened. This way you can see the aggregate metrics of the test run and compare on an example by example basis. You can also download the run and evaluation result information to use in external reporting software.

In this walkthrough, we will show how to export the feedback and examples from a Langsmith test project. The main steps are:

  1. Create a dataset
  2. Run testing
  3. Export feedback and examples

Setup

Install langchain and any other dependencies for your chain. We will install pandas as well for this walkthrough to put the retrieved data in a dataframe.

# %pip install -U langsmith langchain anthropic pandas --quiet
import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Update with your API key
project_name = "YOUR PROJECT NAME" # Update with your project name
os.environ["LANGCHAIN_PROJECT"] = project_name # Optional: "default" is used if not set

1. Create a dataset

We will create a simple KV dataset with a poem topic and a constraint letter (which the model should not use).

from langsmith import Client
import uuid

client = Client()

examples = [
("roses", "o"),
("vikings", "v"),
("planet earth", "e"),
("Sirens of Titan", "t"),
]

dataset_name = f"Download Feedback and Examples {str(uuid.uuid4())}"
dataset = client.create_dataset(dataset_name)

for prompt, constraint in examples:
client.create_example(
{"input": prompt, "constraint": constraint},
dataset_id=dataset.id,
outputs={"constraint": constraint},
)

2. Run testing

We will use a simple custom evaluator that checks whether the prediction contains the constraint letter.

from typing import Any
from langchain.evaluation import StringEvaluator


class ConstraintEvaluator(StringEvaluator):
@property
def requires_reference(self):
return True

def _evaluate_strings(self, prediction: str, reference: str, **kwargs: Any) -> dict:
# Reference in this case is the letter that should not be present
return {
"score": bool(reference not in prediction),
"reasoning": f"prediction contains the letter {reference}",
}
from langchain import chat_models, prompts
from langchain.schema.output_parser import StrOutputParser

from langchain.smith import RunEvalConfig

chain = (
prompts.PromptTemplate.from_template(
"Write a poem about {input} without using the letter {constraint}. Respond directly with the poem with no explanation."
)
| chat_models.ChatAnthropic()
| StrOutputParser()
)

eval_config = RunEvalConfig(
custom_evaluators=[ConstraintEvaluator()],
input_key="input",
)

test_results = client.run_on_dataset(
dataset_name=dataset_name,
llm_or_chain_factory=chain,
evaluation=eval_config,
)

View the evaluation results for project 'test-elderly-war-24' at: https://smith.langchain.com/o/9a6371ef-ea6a-4860-b3bd-9614084873e7/projects/p/029c5f34-bfeb-423f-9a2b-93780061c5c4 [------------------------------------------------->] 4/4

3. Review the feedback and examples

If you want to directly use the results, you can easily access them in tabular format by calling to_dataframe() on the test_results.

test_results.to_dataframe()

| ConstraintEvaluator | input | output | reference | f9fad700-f624-4fd1-bc02-93b6c539b91f | c741b1e2-2ca1-43c4-b12e-396df95e6f7e | b191a1b4-3dda-4ccc-91f6-8948cbd11153 | dbb29f2a-9d55-4b10-bf11-c01c2121935d --- | --- | --- | --- | --- | --- | --- | --- | --- False | {'input': 'Sirens of Titan', 'constraint': 't'} | Here is a poem about Sirens of Titan without ... | {'constraint': 't'} False | {'input': 'planet earth', 'constraint': 'e'} | Our orb spins through space so vast, \nIts la... | {'constraint': 'e'} False | {'input': 'vikings', 'constraint': 'v'} | Here is a poem about vikings without using th... | {'constraint': 'v'} False | {'input': 'roses', 'constraint': 'o'} | Here is a poem about roses without using the ... | {'constraint': 'o'}

If you want to fetch the feedback and examples for a historic test project, you can use the SDK:

# Can be any previous test projects
test_project = test_results["project_name"]
import pandas as pd

runs = client.list_runs(project_name=test_project, execution_order=1)

df = pd.DataFrame(
[
{
"example_id": r.reference_example_id,
**r.inputs,
**(r.outputs or {}),
**{
k: v
for f in client.list_feedback(run_ids=[r.id])
for k, v in [
(f"{f.key}.score", f.score),
(f"{f.key}.comment", f.comment),
]
},
"reference": client.read_example(r.reference_example_id).outputs,
}
for r in runs
]
)
df

| example_id | input | constraint | output | ConstraintEvaluator.score | ConstraintEvaluator.comment | reference | 0 | 1 | 2 | 3 --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- dbb29f2a-9d55-4b10-bf11-c01c2121935d | roses | o | Here is a poem about roses without using the ... | 0.0 | prediction contains the letter o | {'constraint': 'o'} b191a1b4-3dda-4ccc-91f6-8948cbd11153 | vikings | v | Here is a poem about vikings without using th... | 0.0 | prediction contains the letter v | {'constraint': 'v'} c741b1e2-2ca1-43c4-b12e-396df95e6f7e | planet earth | e | Our orb spins through space so vast, \nIts la... | 0.0 | prediction contains the letter e | {'constraint': 'e'} f9fad700-f624-4fd1-bc02-93b6c539b91f | Sirens of Titan | t | Here is a poem about Sirens of Titan without ... | 0.0 | prediction contains the letter t | {'constraint': 't'}

Conclusion

In this example we showed how to download feedback and examples from a test project. You can directly use the result object from the run or use the SDK to fetch the results and feedback. Use this to analyze further or to programmatically add result information to your existing reports.


Was this page helpful?


You can leave detailed feedback on GitHub.