Skip to main content

Evaluators

In this guide, you will create custom evaluators to grade your LLM system. An evaluator can apply any logic you want, returning a numeric score associated with a key.

Most evaluators are applied on a run level, scoring each prediction individually. Some summary_evaluators can be applied on a experiment level, letting you score and aggregate metrics across multiple runs.

Evaluators take in a Run and an Example:

Run Object

The key Run object fields are as follows:

  • outputs: Dict[str, Any] - The outputs of your pipeline
  • inputs: Dict[str, Any] - The inputs to your pipeline. These are the same as the inputs in the example.
  • child_runs: List[Run] - If your pipeline has nested steps, these can be accessed and used in your evaluator

Example Object

The key Example object fields are as follows:

  • outputs: Dict[str, Any] - The reference labels or other context found in your dataset
  • inputs: Dict[str, Any] - The inputs to your pipeline
  • metadata: Dict[str, Any] - Any other metadata you have stored in that example within the dataset

Run Evaluators

Run evaluators are applied to each prediction of your pipeline. The common automated evaluator types are:

  1. Simple Heuristics: Checking for regex matches, presence/absence of certain words or code, etc.
  2. AI-assisted: Instruct an "LLM-as-judge" to grade the output of a run based on the prediction and reference answer (or retrieved context).

We will demonstrate some simple ones below.

Example 1: Exact Match Evaluator

Let's start with a simple evaluator that checks if the model's output exactly matches the reference answer.

from langsmith.schemas import Example, Run

def exact_match(run: Run, example: Example) -> dict:
reference = example.outputs["answer"]
prediction = run.outputs["output"]
score = prediction.lower() == reference.lower()
return {"key": "exact_match", "score": score}

Let's break this down:

  • The evaluator function accepts a Run and Example and returns a dictionary with the evaluation key and score. The run contains the full trace of your pipeline, and the example contains the inputs and outputs for this data point. If your dataset contains labels, they are found in the example.outputs dictionary, which is kept separate to keep your model from cheating.
  • In our dataset, the outputs have an "answer" key that contains the reference answer. Your pipeline generates predictions as a dictionary with an "output" key.
  • It compares the prediction and reference (case-insensitive) and returns a dictionary with the evaluation key and score.

You can use this evaluator directly in the evaluate function:

from langsmith.evaluation import evaluate

evaluate(
<your prediction function>,
data="<dataset_name>",
evaluators=[exact_match],
)

Example 2: Parametrizing your evaluator

You may want to parametrize your evaluator as a class. This is useful when you need to pass additional parameters to the evaluator.

from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run

class BlocklistEvaluator:
def __init__(self, blocklist: list[str]):
self.blocklist = blocklist
def __call__(
self, run: Run, example: Example | None = None
) -> dict:
model_outputs = run.outputs["output"]
score = not any([word in model_outputs for word in self.blocklist])
return {"key": "blocklist", "score": score}


evaluate(
<your prediction function>,
data="<dataset_name>",
evaluators=[BlocklistEvaluator(blocklist=["bad", "words"])],
)

Example 3: Evaluating nested traces

While most evaluations are applied to the inputs and outputs of your system, you can also evaluate all of the subcomponents that are traced within your pipeline.

This is possible by stepping through the run object and collecting the outputs of each component.

As a simple example, let's assume you want to evaluate the expected tools that are invoked in a pipeline.

from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run

def evaluate_trajectory(run: Run, example: Example) -> dict:
# collect the tools on level 1 of the trace tree
steps = [child.name for child in run.child_runs if child.run_type == "tool"]
expected_steps = example.outputs["expected_tools"]
score = len(set(steps) & set(expected_steps)) / len(set(steps) | set(expected_steps))
return {"key": "tools", "score": score}

This lets you grade the performance of intermediate steps in your pipeline.

Note: the example above assumes tools are properly typed in the trace tree.

Example 3: Structured Output

With function calling, it has become easier than ever to generate feedback metrics using LLMs as a judge simply by specifying a Pydantic schema for the output.

Below is an example (in this case using OpenAI's tool calling functionality) to evaluate RAG app faithfulness.

import json
from typing import List

import openai
from langsmith.schemas import Example, Run
from pydantic import BaseModel, Field

openai_client = openai.Client()


class Propositions(BaseModel):
propositions: List[str] = Field(
description="The factual propositions generated by the model"
)


class FaithfulnessScore(BaseModel):
reasoning: str = Field(description="The reasoning for the faithfulness score")
score: bool


def faithfulness(run: Run, example: Example) -> dict:
# Assumes your RAG app includes the prediction in the "output" key in its response
response: str = run.outputs["output"]
# Assumes your RAG app includes the retrieved docs as a "context" key in the outputs
# If not, you can fetch from the child_runs of the run object
retrieved_docs: list = run.outputs["context"]
formatted_docs = "\n".join([str(doc) for doc in retrieved_docs])
extracted = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "user",
"content": "Extract all factual propositions from the following text:\n\n"
f"```\n{response}\n```",
},
],
tools=[
{
"type": "function",
"function": {
"name": "Propositions",
"description": "Use to record each factual assertion.",
"parameters": Propositions.model_json_schema(),
},
}
],
tool_choice={"type": "function", "function": {"name": "Propositions"}},
)
propositions = [
prop
for tc in extracted.choices[0].message.tool_calls
for prop in json.loads(tc.function.arguments)["propositions"]
]
scores, reasoning = [], []
tools = [
{
"type": "function",
"function": {
"name": "FaithfulnessScore",
"description": "Use to score how faithful the propositions are to the docs.",
"parameters": FaithfulnessScore.model_json_schema(),
},
}
]
for proposition in propositions:
faithfulness_completion = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "user",
"content": "Grade whether the proposition can be logically concluded"
f" from the docs:\n\nProposition: {proposition}\nDocs:\n"
f"```\n{formatted_docs}\n```",
},
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "FaithfulnessScore"}},
)
faithfulness_args = json.loads(
faithfulness_completion.choices[0].message.tool_calls[0].function.arguments
)
scores.append(faithfulness_args["score"])
reasoning.append(faithfulness_args["reasoning"])
average_score = sum(scores) / len(scores) if scores else None
comment = "\n".join(reasoning)
return {"key": "faithfulness", "score": average_score, "comment": comment}

Example 4: Returning Multiple Scores

A single evaluator can return multiple scores. An example of when this might be useful is if you are using tool calling for an LLM-as-judge to extract multiple metrics in a single API call.

import json

import openai
from langsmith.schemas import Example, Run
from pydantic import BaseModel, Field

# Initialize the OpenAI client
openai_client = openai.Client()

class Scores(BaseModel):
correctness_reasoning: str = Field(description="The reasoning for the correctness score")
correctness: float = Field(description="The score for the correctness of the prediction")
conciseness_reasoning: str = Field(description="The reasoning for the conciseness score")
conciseness: float = Field(description="The score for the conciseness of the prediction")

def multiple_scores(run: Run, example: Example) -> dict:
reference = example.outputs["answer"]
prediction = run.outputs["output"]

messages = [
{
"role": "user",
"content": f"Reference: {reference}\nPrediction: {prediction}"
},
]

tools = [
{
"type": "function",
"function": {
"name": "Scores",
"description": "Use to evaluate the correctness and conciseness of the prediction.",
"parameters": Scores.model_json_schema(),
},
}
]

# Generating the chat completion with structured output
completion = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=messages,
tools=tools,
tool_choice={"type": "function", "function": {"name": "Scores"}},
)

# Extracting structured scores from the completion
scores_args = json.loads(completion.choices[0].message.tool_calls[0].function.arguments)

return {
"results": [
# Provide the key, score and other relevant information for each metric
{"key": "correctness", "score": scores_args["correctness"], "comment": scores_args["correctness_reasoning"]},
{"key": "conciseness", "score": scores_args["conciseness"], "comment": scores_args["conciseness_reasoning"]}
]
}

Example 5: Perplexity Evaluator

The flexibility of the functional interface means you can easly apply evaluators from any other libraries. For instance, you may want to use statistical measures such as perplexity to grade your run output. Below is an example using the evaluate package by HuggingFace, which contains numerous commonly used metrics. Start by installing the evaluate package by running pip install evaluate.

from evaluate import load
from langsmith.schemas import Example, Run

class PerplexityEvaluator:
def __init__(self, prediction_key: Optional[str] = None, model_id: str = "gpt-2"):
self.prediction_key = prediction_key
self.model_id = model_id
self.metric_fn = load("perplexity", module_type="metric")
def evaluate_run(
self, run: Run, example: Example
) -> dict:
if run.outputs is None:
raise ValueError("Run outputs cannot be None")
prediction = run.outputs[self.prediction_key]
results = self.metric_fn.compute(
predictions=[prediction], model_id=self.model_id
)
ppl = results["perplexities"][0]
return {"key": "Perplexity", "score": ppl}

Let's break down what the PerplexityEvaluator is doing:

  • Initialize: In the constructor, we're setting up a few properties that will be needed later on.
    • prediction_key: The key to find the model's prediction in the outputs of a run.
    • model_id: The ID of the language model you want to use to compute the metric. In our example, we are using 'gpt-2'.
    • metric_fn: The evaluation metric function, loaded from the HuggingFace evaluate package.
  • Evaluate: This method takes a run (and optionally an example) and returns an evaluation dictionary.
    • If the run outputs are None, the evaluator raises an error.
    • Otherwise, the outputs are passed to the metric_fn to compute the perplexity. The perplexity score is then returned as part of the evaluation dictionary. Once you've defined your evaluators, you can use them to evaluate your model:
from langsmith.evaluation import evaluate
evaluate(
<LLM or chain or agent>,
data="<dataset_name>",
evaluators=[BlocklistEvaluator(blocklist=["bad", "words"]), PerplexityEvaluator(), is_empty],
)

Summary Evaluators

Some metrics can only be defined on the entire experiment level as opposed to the individual runs of the experiment. For example, you may want to compute the f1 score of a classifier across all runs in an experiment kicked off from a dataset. These are called summary_evaluators. Instead of taking in a single Run and Example, these evaluators take a list of each.

from typing import List
from langsmith.schemas import Example, Run
from langsmith.evaluation import evaluate

def f1_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:
true_positives = 0
false_positives = 0
false_negatives = 0
for run, example in zip(runs, examples):
# Matches the output format of your dataset
reference = example.outputs["answer"]
# Matches the output dict in `predict` function below
prediction = run.outputs["prediction"]
if reference and prediction == reference:
true_positives += 1
elif prediction and not reference:
false_positives += 1
elif not prediction and reference:
false_negatives += 1
if true_positives == 0:
return {"key": "f1_score", "score": 0.0}

precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1_score = 2 * (precision * recall) / (precision + recall)
return {"key": "f1_score", "score": f1_score}

def predict(inputs: dict):
return {"prediction": True}

evaluate(
predict, # Your classifier
data="<dataset_name>",
summary_evaluators=[f1_score_summary_evaluator],
)

Recap

Congratulations! You created a custom evaluation chain you can apply to any traced run so you can surface more relevant information in your application. LangChain's evaluation chains speed up the development process for application-specific, semantically robust evaluations. You can also extend existing components from the library so you can focus on building your product. All your evals come with:

  • Automatic tracing integrations to help you debug, compare, and improve your code
  • Easy sharing and mixing of components and results
  • Out-of-the-box support for sync and async evaluations for faster runs

Was this page helpful?