Skip to main content

Filter experiments in the UI

LangSmith lets you filter your previous experiments by feedback scores and metadata to make it easy to find only the experiments you care about.

Background: add metadata to your experiments

When you run an experiment in the SDK, you can attach metadata to make it easier to filter in UI. This is helpful if you know what axes you want to drill down into when running experiments.

In our example, we are going to attach metadata to our experiment around the model used, the model provider, and a known ID of the prompt:

models = {
"openai-gpt-4o": ChatOpenAI(model="gpt-4o", temperature=0),
"openai-gpt-3.5-turbo": ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
"anthropic-claude-3-sonnet-20240229": ChatAnthropic(temperature=0, model_name="claude-3-sonnet-20240229")
}
prompts = {
"singleminded": "always answer questions with the word banana.",
"fruitminded": "always discuss fruit in your answers.",
"basic": "you are a chatbot."
}
def answer_evaluator(run, example) -> dict:
llm = ChatOpenAI(model="gpt-4o", temperature=0)
answer_grader = hub.pull("langchain-ai/rag-answer-vs-reference") | llm

score = answer_grader.invoke(
{
"question": example.inputs["question"],
"correct_answer": example.outputs["answer"],
"student_answer": run.outputs,
}
)
return {"key": "correctness", "score": score["Score"]}

dataset_name = "Filterable Dataset"
for model_type, model in models.items():
for prompt_type, prompt in prompts.items():

def predict(example):
return model.invoke(
[("system", prompt), ("user", example["question"])]
)

model_provider = model_type.split("-")[0]
model_name = model_type[len(model_provider) + 1:]

evaluate(
predict,
data=dataset_name,
evaluators=[answer_evaluator],
# ADD IN METADATA HERE!!
metadata={
"model_provider": model_provider,
"model_name": model_name,
"prompt_id": prompt_type
}
)

Filter experiments in the UI

In the UI, we see all experiments that have been run by default.

If we, say, have a preference for openai models, we can easily filter down and see scores within just openai models first:

We can stack filters, allowing us to filter out low scores on correctness to make sure we only compare relevant experiments:

Finally, we can clear and reset filters. For example, if we see there is clear there's a winner with the singleminded prompt, we can change filtering settings to see if any other model providers' models work as well with it:


Was this page helpful?


You can leave detailed feedback on GitHub.