Skip to main content

How to create custom evaluators

In this guide, you will create a custom evaluator to grade your agent. You can choose to use LangChain components or write your own custom evaluator from scratch.

Run Evaluators

Run evaluators return a score and feedback (metric) key for a given run. You can define them however you see fit. Common types include:

  1. Heuristics: Checking for regex matches, presence/absence of certain words or code, etc.
  2. AI-assisted: Instruct an LLM to grade the output of a run based on the prediction reference answer.

We will demonstrate some simple ones below.

Example 1: Emptiness Evaluator

from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run


@run_evaluator
def is_empty(run: Run, example: Example | None = None):
model_outputs = run.outputs["output"]
score = not model_outputs.strip()
return EvaluationResult(key="is_empty", score=score)

Example 2: Blocklist Evaluator

You may want to parametrize your evaluator as a class. For this, you can use the RunEvaluator class, which is functionally equivalent to the decorator above.

from langsmith.evaluation import EvaluationResult, RunEvaluator
from langsmith.schemas import Example, Run

class BlocklistEvaluator(RunEvaluator):
def __init__(self, blocklist: list[str]):
self.blocklist = blocklist

def evaluate_run(
self, run: Run, example: Example | None = None
) -> EvaluationResult:
model_outputs = run.outputs["output"]
score = not any([word in model_outputs for word in self.blocklist])
return EvaluationResult(key="blocklist", score=score)

Example 3: Perplexity Evaluator

You may want to use statistical measures such as perplexity to grade your run output. Below is an example in which we use the evaluate package by HuggingFace, which contains numerous commonly used metrics for tasks such as text generation, machine translation, and more. Start by installing the evaluate package by running pip install evaluate.

from typing import Optional

from evaluate import load
from langsmith.evaluation import EvaluationResult, RunEvaluator
from langsmith.schemas import Example, Run


class PerplexityEvaluator(RunEvaluator):
def __init__(self, prediction_key: Optional[str] = None, model_id: str = "gpt-2"):
self.prediction_key = prediction_key
self.model_id = model_id
self.metric_fn = load("perplexity", module_type="metric")

def evaluate_run(
self, run: Run, example: Optional[Example] = None
) -> EvaluationResult:
if run.outputs is None:
raise ValueError("Run outputs cannot be None")
prediction = run.outputs[self.prediction_key]
results = self.metric_fn.compute(
predictions=[prediction], model_id=self.model_id
)
ppl = results["perplexities"][0]
return EvaluationResult(key="Perplexity", score=ppl)

Let's break down what the PerplexityEvaluator is doing:

  • Initialize: In the constructor, we're setting up a few properties that will be needed later on.

    • prediction_key: The key to find the model's prediction in the outputs of a run.
    • model_id: The ID of the language model you want to use to compute the metric. In our example, we are using 'gpt-2'.
    • metric_fn: The evaluation metric function, which is loaded from the HuggingFace evaluate package.
  • Evaluate: This method takes a run (and optionally an example) and returns an EvaluationResult.

    • If the run outputs are None, the evaluator raises an error.
    • Otherwise, the outputs are passed to the metric_fn to compute the perplexity. The perplexity score is then returned as part of an EvaluationResult.

Once you've define your evaluators, you can use them to evaluate your model.

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset

ds = client.create_dataset("My Dataset")
client.create_examples(,
inputs=[
{"input": "Hello"},
{"input": "How are you?"},
],
outputs=[
{"output": "I'm good, thanks!"},
{"output": "I'm not doing so well."},
],
dataset_id=ds.id,
)

evaluation_config = RunEvalConfig(
custom_evaluators = [PerplexityEvaluator(), BlocklistEvaluator(blocklist=["bad", "words"]), is_empty],
)

def my_model(inputs):
return "This is a bad model"

client.run_on_dataset(
dataset_name="My Dataset",
llm_or_chain_factory=my_model,
evaluation=evaluation_config,
)

Custom LangChain string evaluators

Any custom LangChain StringEvaluator can be directly used for evaluation.

In this section, you will create a LangChain string evaluator that grades the relevance of a model's response to the input. You can also consult the reference documentation for more details.

Step 1: Define the evaluator

We will use an LLMChain to perform the grading. That logic can be any custom code. In this case, we will use an LLM call to output a grade from 0 to 100 based on how relevant the model thinks the output is to the input.

import re
from typing import Any, Optional
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.evaluation import StringEvaluator


class RelevanceEvaluator(StringEvaluator):
"""An LLM-based relevance evaluator."""

def __init__(self):
llm = ChatOpenAI(model="gpt-4", temperature=0)

template = """On a scale from 0 to 100, how relevant is the following response to the input:
--------
INPUT: {input}
--------
OUTPUT: {prediction}
--------
Reason step by step about why the score is appropriate, then print the score at the end. At the end, repeat that score alone on a new line."""

self.eval_chain = PromptTemplate.from_template(template) | llm

@property
def requires_input(self) -> bool:
return True

@property
def requires_reference(self) -> bool:
return False

@property
def evaluation_name(self) -> str:
return "scored_relevance"

def _evaluate_strings(
self,
prediction: str,
input: Optional[str] = None,
reference: Optional[str] = None,
**kwargs: Any
) -> dict:
evaluator_result = self.eval_chain.invoke(
{"input": input, "prediction": prediction}, kwargs
)
reasoning, score = evaluator_result["text"].split("\n", maxsplit=1)
score = re.search(r"\d+", score).group(0)
if score is not None:
score = float(score.strip()) / 100.0
return {"score": score, "reasoning": reasoning.strip()}

Let's break down what the RelevanceEvaluator is doing:

  • Initialize: In the constructor, we're instructing the LLMChain on how to grade the output.
  • requires_input: This makes sure the evaluation helper extracts the input string from your chain or LLM's input and handles additional validation.
  • requires_reference: This validator doesn't require a reference string to be passed in, meaning that your dataset doesn't even require outputs to be present.
  • evaluation_name: This is the key assigned to the feedback generated from your evaluator. You can look for runs with feedback by "scored_relevance" in the LangSmith app.
  • _evaluate_strings: This is the function that actually evaluates the input and output strings. In this case, we're using the LLMChain to generate a score and reasoning for the score.

Step 2: Evaluate

Start evaluating your chain, agent, or LLM using the run_on_dataset function.

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
custom_evaluators = [RelevanceEvaluator()],
)
run_on_dataset(
dataset_name="<my_dataset_name>",
llm_or_chain_factory=<llm or function constructing chain>,
client=client,
evaluation=evaluation_config,
project_name="<the name to assign to this test project>",
)

Recap

Congratulations! You created a custom evaluation chain you can apply to any traced run so you can surface more relevant information in your application.

LangChain's evaluation chains speed up the development process for application-specific, semantically robust evaluations.

You can also extend existing components from the library so you can focus on building your product. All your evals come with:

  • Automatic tracing integrations to help you debug, compare, and improve your code
  • Easy sharing and mixing of components and results
  • Out-of-the-box support for sync and async evaluations for faster runs