How to Use Off-the-Shelf Evaluators
LangChain's evaluation module provides evaluators you can use as-is for common evaluation scenarios.
It's easy to use these by passing them to the evaluators
argument of the evaluate()
function.
Copy the code snippets below to get started. You can also configure them for your applications using the arguments mentioned in the "Parameters" sections.
If you don't see an implementation that suits your needs, you can learn how to create your own Custom Run Evaluators in the linked guide, or contribute an string evaluator to the LangChain repository.
Most of these evaluators are useful but imperfect! We recommend against blind trust of any single automated metric and to always incorporate them as a part of a holistic testing and evaluation strategy.
Many of the LLM-based evaluators return a binary score for a given data point, so measuring differences in prompt or model performance are most reliable in aggregate over a larger dataset.
Overview
The following table enumerates the off-the-shelf evaluators available in LangSmith, along with their output keys and a simple code sample.
Evaluator name | Output Key | Simple Code Example |
---|---|---|
QA | correctness | LangChainStringEvaluator("qa") |
Contextual Q&A | contextual accuracy | LangChainStringEvaluator("context_qa") |
Chain of Thought Q&A | cot contextual accuracy | LangChainStringEvaluator("cot_qa") |
Criteria | Depends on criteria key | LangChainStringEvaluator("criteria", config={ "criteria": <criterion> }) criterion may be one of the default implemented criteria: conciseness , relevance , correctness , coherence , harmfulness , maliciousness , helpfulness , controversiality , misogyny , and criminality .Or, you may define your own criteria in a custom dict as follows: { "criterion_key": "criterion description" } |
Labeled Criteria | Depends on criteria key | LangChainStringEvaluator("labeled_criteria", config={ "criteria": <criterion> }) criterion may be one of the default implemented criteria: conciseness , relevance , correctness , coherence , harmfulness , maliciousness , helpfulness , controversiality , misogyny , and criminality .Or, you may define your own criteria in a custom dict as follows: { "criterion_key": "criterion description" } |
Score | Depends on criteria key | LangChainStringEvaluator("score_string", config={ "criteria": <criterion>, "normalize_by": 10 }) criterion may be one of the default implemented criteria: conciseness , relevance , correctness , coherence , harmfulness , maliciousness , helpfulness , controversiality , misogyny , and criminality .Or, you may define your own criteria in a custom dict as follows: { "criterion_key": "criterion description" } . Scores are out of 10, so normalize_by will cast this to a score from 0 to 1. |
Labeled Score | Depends on criteria key | LangChainStringEvaluator("labeled_score_string", config={ "criteria": <criterion>, "normalize_by": 10 }) criterion may be one of the default implemented criteria: conciseness , relevance , correctness , coherence , harmfulness , maliciousness , helpfulness , controversiality , misogyny , and criminality .Or, you may define your own criteria in a custom dict as follows: { "criterion_key": "criterion description" } . Scores are out of 10, so normalize_by will cast this to a score from 0 to 1. |
Embedding distance | embedding_cosine_distance | LangChainStringEvaluator("embedding_distance") |
String Distance | string_distance | LangChainStringEvaluator("string_distance", config={"distance": "damerau_levenshtein" }) distance defines the string difference metric to be applied, such as levenshtein or jaro_winkler . |
Exact Match | exact_match | LangChainStringEvaluator("exact_match") |
Regex Match | regex_match | LangChainStringEvaluator("regex_match") |
Json Validity | json_validity | LangChainStringEvaluator("json_validity") |
Json Equality | json_equality | LangChainStringEvaluator("json_equality") |
Json Edit Distance | json_edit_distance | LangChainStringEvaluator("json_edit_distance") |
Json Schema | json_schema | LangChainStringEvaluator("json_schema") |
Correctness: QA evaluation
QA evalutors help to measure the correctness of a response to a user query or question. If you have a dataset with reference labels or reference context docs, these are the evaluators for you!
Three QA evaluators you can load are: "qa"
, "context_qa"
, "cot_qa"
. Based on our meta-evals, we recommend using "cot_qa"
or a similar prompt for best results.
- The
"qa"
evaluator (reference) instructs an llm to directly grade a response as "correct" or "incorrect" based on the reference answer. - The
"context_qa"
evaluator (reference) instructs the LLM chain to use reference "context" (provided throught the example outputs) in determining correctness. This is useful if you have a larger corpus of grounding docs but don't have ground truth answers to a query. - The
"cot_qa"
evaluator (reference) is similar to the "context_qa" evaluator, except it instructs the LLMChain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.
- Python
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
qa_evaluator = LangChainStringEvaluator("qa")
context_qa_evaluator = LangChainStringEvaluator("context_qa")
cot_qa_evaluator = LangChainStringEvaluator("cot_qa")
client = Client()
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[qa_evaluator, context_qa_evaluator, cot_qa_evaluator],
metadata={"revision_id": "the version of your pipeline you are testing"},
)
- Python
from langchain.chat_models import ChatAnthropic
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator
_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{input}
Here is the real answer:
{reference}
You are grading the following predicted answer:
{prediction}
Respond with CORRECT or INCORRECT:
Grade:
"""
PROMPT = PromptTemplate(
input_variables=["input", "reference", "prediction"], template=_PROMPT_TEMPLATE
)
eval_llm = ChatAnthropic(temperature=0.0)
qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm, "prompt": PROMPT})
context_qa_evaluator = LangChainStringEvaluator("context_qa", config={"llm": eval_llm})
cot_qa_evaluator = LangChainStringEvaluator("cot_qa", config={"llm": eval_llm})
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[qa_evaluator, context_qa_evaluator, cot_qa_evaluator],
)
Criteria Evaluators (No Labels)
If you don't have ground truth reference labels, you can evaluate your run against a custom set of criteria using the "criteria"
or "score"
evaluators. These are helpful when there are high level semantic aspects of your model's output you'd like to monitor that aren't captured by other explicit checks or rules.
- The
"criteria"
evaluator (reference) instructs an LLM to assess if a prediction satisfies the given criteria, outputting a binary score - The
"score_string"
evaluator (reference) has the LLM score the prediction on a numeric scale (default 1-10) based on how well it satisfies the criteria
- Python
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
criteria_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"creativity": "Is this submission creative, imaginative, or novel?",
"conciseness": "Is this response concise and to the point?",
}
}
)
score_evaluator = LangChainStringEvaluator(
"score_string",
config={
"criteria": {
"accuracy": "How accurate is this prediction on a scale of 1-10?"
},
# If you want the score to be saved on a scale from 0 to 1
"normalize_by": 10,
}
)
client = Client()
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
criteria_evaluator,
score_evaluator
],
metadata={"revision_id": "the version of your pipeline you are testing"},
)
Default criteria are implemented for the following aspects: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.
To specify custom criteria, write a mapping of a criterion name to its description, such as:
criterion = {"creativity": "Is this submission creative, imaginative, or novel?"}
criteria_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": criterion}
)
Evaluation scores don't have an inherent "direction" (i.e., higher is not necessarily better). The direction of the score depends on the criteria being evaluated. For example, a score of 1 for "helpfulness" means that the prediction was deemed to be helpful by the model. However, a score of 1 for "maliciousness" means that the prediction contains malicious content, which, of course, is "bad".
Criteria Evaluators (With Labels)
If you have ground truth reference labels, you can evaluate your run against custom criteria while also providing that reference information to the LLM using the "labeled_criteria"
or "labeled_score_string"
evaluators.
- The
"labeled_criteria"
evaluator (reference) instructs an LLM to assess if a prediction satisfies the criteria, taking into account the reference label - The
"labeled_score_string"
evaluator (reference) has the LLM score the prediction on a numeric scale based on how well it satisfies the criteria compared to the reference
- Python
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate
labeled_criteria_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={
"criteria": {
"helpfulness": (
"Is this submission helpful to the user,"
" taking into account the correct reference answer?"
)
}
},
prepare_data=lambda run, example: {
"prediction": run.outputs["output"],
"reference": example.outputs["answer"],
"input": example.inputs["question"],
}
)
labeled_score_evaluator = LangChainStringEvaluator(
"labeled_score_string",
config={
"criteria": {
"accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?"
},
"normalize_by": 10,
},
prepare_data=lambda run, example: {
"prediction": run.outputs["output"],
"reference": example.outputs["answer"],
"input": example.inputs["question"],
}
)
client = Client()
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
labeled_criteria_evaluator,
labeled_score_evaluator
],
metadata={"revision_id": "the version of your pipeline you are testing"},
)
JSON Evaluators
Evaluating extraction and function calling applications often comes down to validating that the LLM's string output can be parsed correctly and comparing it to a reference object. The JSON evaluators provide functionality to check your model's output consistency:
- The
"json_validity"
(reference) evaluator checks if a prediction is valid JSON - The
"json_equality"
(reference) evaluator checks if a JSON prediction exactly matches a JSON reference, after normalization - The
"json_edit_distance"
(reference) evaluator computes the normalized edit distance between a JSON prediction and reference - The
"json_schema"
(reference) evaluator checks if a JSON prediction satisfies a provided JSON schema
- Python
from langsmith.evaluation import LangChainStringEvaluator, evaluate
json_validity_evaluator = LangChainStringEvaluator("json_validity")
json_equality_evaluator = LangChainStringEvaluator("json_equality")
json_edit_distance_evaluator = LangChainStringEvaluator("json_edit_distance")
json_schema_evaluator = LangChainStringEvaluator(
"json_schema",
config={
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0}
},
"required": ["name"]
}
}
)
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
json_validity_evaluator,
json_equality_evaluator,
json_edit_distance_evaluator,
json_schema_evaluator
],
)
String or Embedding Distance
To measure the similarity between a predicted string and a reference, you can use string distance metrics:
- The
"string_distance"
(reference) evaluator computes a normalized string edit distance between the prediction and reference - The
"embedding_distance"
(reference) evaluator computes the distance between the text embeddings of the prediction and reference - The
"exact_match"
(reference) evaluator checks for an exact string match between prediction and reference
- Python
from langsmith.evaluation import LangChainStringEvaluator, evaluate
string_distance_evaluator = LangChainStringEvaluator(
"string_distance",
config={"distance": "levenshtein", "normalize_score": True}
)
embedding_distance_evaluator = LangChainStringEvaluator(
"embedding_distance",
config={
# Defaults to OpenAI, but you can customize which embedding provider to use:
# "embeddings": HuggingFaceEmbeddings(model="distilbert-base-uncased"),
# Can also choose "euclidean", "chebyshev", "hamming", and "manhattan"
"distance_metric": "cosine",
}
)
exact_match_evaluator = LangChainStringEvaluator(
"exact_match",
config={"ignore_case": True, "ignore_punctuation": True}
)
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
string_distance_evaluator,
embedding_distance_evaluator,
exact_match_evaluator
],
)
Regex Match
To evaluate predictions against a reference regular expression pattern, you can use the "regex_match"
(reference) evaluator.
The pattern is provided as a string in the example outputs of the dataset. The evaluator checks if the prediction matches the pattern.
- Python
from langsmith.evaluation import LangChainStringEvaluator, evaluate
regex_evaluator = LangChainStringEvaluator(
"regex_match",
config={
# Optionally control which flags to use in the regex match
"flags": re.IGNORECASE
}
)
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[regex_evaluator],
)
Don't see what you're looking for?
These implementations are just a starting point. We want to work with you to build better off-the-shelf evaluation tools for everyone. We'd love feedback and contributions! Send us feedback at support@langchain.dev, check out the Evaluators in LangChain or submit PRs or issues directly to better address your needs.