Evaluation Quick Start
In this walkthrough, you will evaluate a chain over a dataset of examples. To do so, you will:
- Create a dataset of example inputs
- Define an LLM, chain, or agent to evaluate
- Configure and run the evaluation
- Review the resulting traces and evaluation feedback in LangSmith
Prerequisites
This walkthrough assumes you have already installed LangChain and openai
and configured your environment to connect to LangSmith.
pip install -U "langchain[openai]"
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your api key>
1. Create a dataset
Upload a dataset to LangSmith to use for evaluation. For this example, we will upload a pre-made list of input examples.
For more information on other ways to create and use datasets, check out the datasets guide.
- Python
from langsmith import Client
example_inputs = [
"a rap battle between Atticus Finch and Cicero",
"a rap battle between Barbie and Oppenheimer",
"a Pythonic rap battle between two swallows: one European and one African",
"a rap battle between Aubrey Plaza and Stephen Colbert",
]
client = Client()
dataset_name = "Rap Battle Dataset"
# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
dataset_name=dataset_name, description="Rap battle prompts.",
)
for input_prompt in example_inputs:
# Each example must be unique and have inputs defined.
# Outputs are optional
client.create_example(
inputs={"question": input_prompt},
outputs=None,
dataset_id=dataset.id,
)
2. Define cognitive architecture to evaluate
LangSmith can evaluate any Runnable LangChain component or any custom function over this dataset.
If your cognitive architecture uses state, such as conversational memory, you can provide a constructor function that creates a new instance of your object for each example row in the dataset. If your cognitive architecture is stateless, you can directly pass the object or function in.
Custom functions that are not LangChain components will be automatically wrapped in a RunnableLambda so that all inferences will be traced.
- Runnable
- Chain or Agent
- LLM or Chat Model
- Custom function
- Custom class
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnableMap, RunnablePassthrough
# Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.
def create_runnable():
llm = ChatOpenAI(temperature=0)
prompt = ChatPromptTemplate.from_messages([
("human", "Spit some bars about {input}.")
])
return RunnableMap({"input": RunnablePassthrough()}) | prompt | llm
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
# Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.
def create_chain():
llm = ChatOpenAI(temperature=0)
return LLMChain.from_string(llm, "Spit some bars about {input}.")
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0)
# You can also evaluate any arbitrary function over the dataset.
# The input to the function will be the inputs dictionary for each example.
def predict_result(input_: dict) -> dict:
return {"output": "Bar Bar Bar"}
# If your predictor is stateful (e.g. it has memory),
# You can create a new instance of the predictor for each row in the dataset.
class MyPredictor:
def __init__(self):
self.state = 0
def predict(self, input_: dict) -> dict:
if self.state > 0:
raise ValueError("This predictor is stateful and can only be called once."")
self.state += 1
return {"output": f"Bar Bar Bar {self.state}"}
def create_object() -> MyPredictor:
predictor = MyPredictor()
# Return the function that will be called on the next row
return predictor.predict
3. Evaluate
LangChain provides a convenient run_on_dataset and async arun_on_dataset method to generate predictions (and traces) over a dataset. When a RunEvalConfig is provided, the configured evalutors will be applied to the predictions as well to generate automated feedback.
Below, configure evaluation for some custom criteria. The feedback will be automatically logged within LangSmith. Since the input examples we created above lack "ground truth" reference labels, we will only select reference-free "Criteria" evaluators.
For more information on evaluators you can use off-the-shelf, check out the pre-built evaluators docs or the reference documentation for LangChain's evalution module. For more information on how to write a custom evaluator, check out the custom evaluators guide.
- Runnable
- Chain or Agent
- LLM or Chat Model
- Custom function
- Custom class
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria(
{"cliche": "Are the lyrics cliche?"
" Respond Y if they are, N if they're entirely unique."}
)
]
)
run_on_dataset(
client=client,
dataset_name=dataset_name,
llm_or_chain_factory=create_runnable,
evaluation=eval_config,
verbose=True,
)
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria(
{"cliche": "Are the lyrics cliche?"
" Respond Y if they are, N if they're entirely unique."}
)
]
)
run_on_dataset(
client=client,
dataset_name=dataset_name,
llm_or_chain_factory=create_chain,
evaluation=eval_config,
verbose=True,
project_name="llmchain-test-1",
)
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria("misogyny"),
RunEvalConfig.Criteria(
{"cliche": "Are the lyrics cliche? "
"Respond Y if they are, N if they're entirely unique."}
)
]
)
run_on_dataset(
dataset_name=dataset_name,
llm_or_chain_factory=llm,
evaluation=eval_config,
client=client,
verbose=True,
project_name="chatopenai-test-1",
)
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria("misogyny"),
RunEvalConfig.Criteria(
{"cliche": "Are the lyrics cliche? "
"Respond Y if they are, N if they're entirely unique."}
)
]
)
run_on_dataset(
dataset_name=dataset_name,
llm_or_chain_factory=predict_result,
evaluation=eval_config,
client=client,
verbose=True,
project_name="custom-function-test-1",
)
from langchain.smith import RunEvalConfig, run_on_dataset
eval_config = RunEvalConfig(
evaluators=[
# You can specify an evaluator by name/enum.
# In this case, the default criterion is "helpfulness"
"criteria",
# Or you can configure the evaluator
RunEvalConfig.Criteria("harmfulness"),
RunEvalConfig.Criteria("misogyny"),
RunEvalConfig.Criteria(
{"cliche": "Are the lyrics cliche? "
"Respond Y if they are, N if they're entirely unique."}
)
]
)
run_on_dataset(
dataset_name=dataset_name,
# We are passing the "factory" function in this case.
llm_or_chain_factory=create_object,
evaluation=eval_config,
client=client,
verbose=True,
project_name="custom-class-test-1",
)
4. Review Results
The evaluation results will be streamed to a new test project linked to your "Rap Battle Dataset". You can view the results by clicking on the link printed by the run_on_dataset
function or by navigating to the Datasets & Testing page, clicking "Rap Battle Dataset", and viewing the latest test run.
There, you can inspect the traces and feedback generated from the evaluation configuration.
You can click on any row to view the trace and feedback generated for that example.
To view the outputs for other runs on that same example rows, click View all reference runs
.
More on evaluation
Congratulations! You've now created a dataset and used it to evaluate your agent or LLM. To learn more about evaluation chains available out of the box, check out the LangChain Evaluators guide. To learn how to make your own custom evaluators, review the Custom Evaluator guide.