Skip to main content

Evaluating Agents' Intermediate Steps

Open In Collab Open In GitHub

In many scenarios, evaluating an agent isn't merely about the final outcome, but about understanding the steps it took to arrive at that decision. This notebook provides an introductory walkthrough on configuring an evaluator to assess an agent based on its "decision-making process," scoring based on the sequence of selected tools.

Example Agent Trace

We'll create a custom run evaluator that captures and compares the intermediate steps of the agent against a pre-defined sequence. This ensures that the agent isn't just providing the correct answers but is also being efficient about how it is using external resources.

The basic steps are:

  • Prepare a dataset with input queries and expected agent actions
  • Define the agent with specific tools and behavior
  • Construct custom evaluators that check the actions taken
  • Running the evaluation

Once the evaluation is completed, you can review the results in LangSmith. By the end of this guide, you'll have a better sense of how to apply an evaluator to more complex inputs like an agent's trajectory.

%pip install -U langchain openai
import os

os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY"
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

1. Prepare dataset

Define a new dataset. At minimum, the dataset should have input queries the agent is tasked to solve. We will also store expected steps in our dataset to demonstrate the sequence of actions the agent is expected to take in order to resolve the query.

Optionally, you can store reference labels to evaluate the agent's "correctness" in an end-to-end fashion.

import uuid

from langsmith import Client

client = Client()

questions = [
(
"Why was was a $10 calculator app one of the best-rated Nintendo Switch games?",
{
"reference": "It became an internet meme due to its high price point.",
"expected_steps": ["duck_duck_go"],
},
),
(
"hi",
{
"reference": "Hello, how can I assist you?",
"expected_steps": [], # Expect a direct response
},
),
(
"Who is Dejan Trajkov?",
{
"reference": "Macedonian Professor, Immunologist and Physician",
"expected_steps": ["duck_duck_go"],
},
),
(
"Who won the 2023 U23 world wresting champs (men's freestyle 92 kg)",
{
"reference": "Muhammed Gimri from turkey",
"expected_steps": ["duck_duck_go"],
},
),
(
"What's my first meeting on Friday?",
{
"reference": 'Your first meeting is 8:30 AM for "Team Standup"',
"expected_steps": ["check_calendar"], # Only expect calendar tool
},
),
]

uid = uuid.uuid4()
dataset_name = f"Agent Eval Example {uid}"
ds = client.create_dataset(
dataset_name=dataset_name,
description="An example agent evals dataset using search and calendar checks.",
)
client.create_examples(
inputs=[{"question": q[0]} for q in questions],
outputs=[q[1] for q in questions],
dataset_id=ds.id,
)

2. Define agent

The main components of an agentic program are:

  • The agent (or runnable) that accepts the query and intermediate and responds with the next action to take
  • The tools the agent has access to
  • The executor, which controls the looping behavior when choosing subsequent actions

In this example, we will create an agent with access to a DuckDuckGo search client (for informational search) and a mock tool to check a user's calendar for a given date.

Our agent will use OpenAI function calling to ensure it generates arguments that conform to the tool's expected input schema.

from dateutil.parser import parse
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain.agents.format_scratchpad import format_to_openai_functions
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.tools import DuckDuckGoSearchResults, tool
from langchain.tools.render import format_tool_to_openai_function
from langchain_openai import ChatOpenAI


@tool
def check_calendar(date: str) -> list:
"""Check the user's calendar for a meetings on the specified datetime (in iso format)."""
date_time = parse(date)
# A placeholder to demonstrate with multiple tools.
# It's easy to mock tools when testing.
if date_time.weekday() == 4:
return [
"8:30 : Team Standup",
"9:00 : 1 on 1",
"9:45 design review",
]
return ["Focus time"] # If only...


def agent(inputs: dict):
llm = ChatOpenAI(
model="gpt-3.5-turbo-16k",
temperature=0,
)
tools = [
DuckDuckGoSearchResults(
name="duck_duck_go"
), # General internet search using DuckDuckGo
check_calendar,
]
prompt = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant."),
MessagesPlaceholder(variable_name="agent_scratchpad"),
("user", "{question}"),
]
)
runnable_agent = create_openai_tools_agent(llm, tools, prompt)

executor = AgentExecutor(
agent=runnable_agent,
tools=tools,
handle_parsing_errors=True,
return_intermediate_steps=True,
)
return executor.invoke(inputs)

3. Define evaluators

We will create a custom run evaluator to check the agent trajectory. It compares the run's intermediate steps against the "ground truth" we saved in the dataset above.

Review the code below. Note that this evaluator expects the agent's response to contain the "intermediate_steps" key containing the list of agent actions. This is done by setting return_intermediate_steps=True above.

This also expects your dataset to have the "expected_steps" key in each example outputs, as done above.

from typing import Optional

from langsmith.schemas import Example, Run


def intermediate_step_correctness(run: Run, example: Optional[Example] = None) -> dict:
if run.outputs is None:
raise ValueError("Run outputs cannot be None")
# This is the output of each run
intermediate_steps = run.outputs.get("intermediate_steps") or []
# Since we are comparing to the tool names, we now need to get that
# Intermediate steps is a Tuple[AgentAction, Any]
# The first element is the action taken
# The second element is the observation from taking that action
trajectory = [action.tool for action, _ in intermediate_steps]
# This is what we uploaded to the dataset
expected_trajectory = example.outputs["expected_steps"]
# Just score it based on whether it is correct or not
score = int(trajectory == expected_trajectory)
return {"key": "Intermediate steps correctness", "score": score}

4. Evaluate

Add your custom evaluator to the custom_evaluators list in the evaluation configuration below.

Since our dataset has multiple output keys, we have to instruct the run_on_dataset function on which key to use as the ground truth for the QA evaluator by setting reference_key="reference" below.

from langsmith.evaluation import LangChainStringEvaluator, evaluate


# We now need to specify this because we have multiple outputs in our dataset
def prepare_data(run: Run, example: Example) -> dict:
return {
"input": example.inputs["question"],
"prediction": run.outputs["output"],
"reference": example.outputs["reference"],
}


# Measures whether a QA response is "Correct", based on a reference answer
qa_evaluator = LangChainStringEvaluator("qa", prepare_data=prepare_data)
chain_results = evaluate(
agent,
data=dataset_name,
evaluators=[intermediate_step_correctness, qa_evaluator],
experiment_prefix="Agent Eval Example",
max_concurrency=1,
)

Error running target function: _get_url() https://links.duckduckgo.com/d.js DuckDuckGoSearchException: Ratelimit

Conclusion

Congratulations! You've succesfully performed a simple evaluation of the agent's trajectory by comparing it to an expected sequence of actions. This is useful when you know the expected steps to take.

Once you've configured a custom evaluator for this type of evaluation, it's easy to apply other techniques using off-the-shelf evaluators like LangChain's TrajectoryEvalChain, which can instruct an LLM to grade the efficacy of the agent's actions.


Was this page helpful?


You can leave detailed feedback on GitHub.