LangSmith Evaluation Deep Dive

Preface

Video 1: Slides reviewing why evals are important here
Video 2: Slides reviewing LangSmith primitives here

Summary

See here for an overview of evaluation: https://docs.smith.langchain.com/evaluation

Enviornment

! pip install langsmith openai ollama

import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true' # enables tracing 
os.environ['LANGCHAIN_API_KEY'] = <your-api-key>

import os
os.environ['LANGCHAIN_PROJECT'] = 'Test'

3. Dataset: Manually Curated

Question:

How can I build my own dataset?

Setup:

Let's build a dataset of question-answer pairs on this blog post about DBRX:

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

We'll build a Manually Curated dataset of input, output pairs:

Screenshot 2024-04-01 at 4.31.04 PM.png

import pandas as pd

# QA
inputs = [
    "How many tokens was DBRX pre-trained on?",
    "Is DBRX a MOE model and how many parameters does it have?",
    "How many GPUs was DBRX trained on and what was the connectivity between GPUs?"
]

outputs = [
    "DBRX was pre-trained on 12 trillion tokens of text and code data.",
    "Yes, DBRX is a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters.",
    "DBRX was trained on 3072 NVIDIA H100s connected by 3.2Tbps Infiniband"
]

# Dataset
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]
df = pd.DataFrame(qa_pairs)

# Write to csv
csv_path = "/Users/rlm/Desktop/DBRX_eval.csv"
df.to_csv(csv_path, index=False)

LangSmith SDK docs:

https://docs.smith.langchain.com/evaluation/quickstart#1-create-a-dataset

from langsmith import Client

client = Client()
dataset_name = "DBRX"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about DBRX model.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

Update dataset

new_questions = [
    "What is the context window of DBRX Instruct?", 
]

new_answers = [
    "DBRX Instruct was trained with up to a 32K token context window.",
]

# See updated version in the UI
client.create_examples(
    inputs=[{"question": q} for q in new_questions],
    outputs=[{"answer": a} for a in new_answers],
    dataset_id=dataset.id,
)

We can also create a dataset directly from a csv with the LangSmith UI.

LangSmith UI docs:

https://docs.smith.langchain.com/evaluation/faq/manage-datasets

4. Dataset: From User Logs

Question:

How can I save user logs as a dataset for future testing?

Screenshot 2024-04-01 at 4.31.11 PM.png

# Create a new project where user question are logged

import os
os.environ['LANGCHAIN_PROJECT'] = 'DBRX'

# Load blog post

import requests
from bs4 import BeautifulSoup
url = 'https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
text = [p.text for p in soup.find_all('p')]
full_text = '\n'.join(text)

# OpenAI API

import openai
from langsmith.wrappers import wrap_openai
openai_client = wrap_openai(openai.Client())

def answer_dbrx_question_oai(inputs: dict) -> dict:
    """
    Generates answers to user questions based on a provided website text using OpenAI API.

    Parameters:
    inputs (dict): A dictionary with a single key 'question', representing the user's question as a string.

    Returns:
    dict: A dictionary with a single key 'output', containing the generated answer as a string.
    """

    # System prompt 
    system_msg = f"Answer user questions in 2-3 sentences about this context: \n\n\n {full_text}"
    
    # Pass in website text
    messages = [{"role": "system", "content": system_msg},
                {"role": "user", "content": inputs["question"]}]

    # Call OpenAI
    response = openai_client.chat.completions.create(messages=messages, model="gpt-3.5-turbo")

    # Response in output dict
    return {"answer": response.dict()['choices'][0]['message']['content']} 

# User question example

answer_dbrx_question_oai({"question":"What are the main differences in training efficiency between MPT-7B vs DBRX?"})

# User question example

answer_dbrx_question_oai({"question":"How many tokens was DBRX pre-trained on?"})

5. LLM-as-Judge: Built-in evaluator

Question:

How can I evaluate the my LLM against my dataset?

Evaluation flow

Screenshot 2024-04-01 at 4.17.29 PM.png

Built-in evaluator

https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations

CoT_qa

Use chain of thought "reasoning" before determining a final verdict

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Evaluators
qa_evalulator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "DBRX"

experiment_results = evaluate(
    answer_dbrx_question_oai,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-dbrx-qa-oai",
    # Any experiment metadata can be specified here
    metadata={
      "variant": "stuff website context into gpt-3.5-turbo",
    },
)

What did we do?

Screenshot 2024-04-02 at 11.12.51 AM.png

6. Custom evaluator

Question:

How can I define my own custom evaluator?

Let's say we want to define a simple assertion that an answer is actually generated.

Screenshot 2024-04-02 at 11.13.44 AM.png

from langsmith.schemas import Run, Example

def is_answered(run: Run, example: Example) -> dict: 

    # Get outputs
    student_answer = run.outputs.get("answer")
    
    # Check if the student_answer is an empty string
    if not student_answer:
        return {"key": "is_answered" , "score": 0} 
    else:
        return {"key": "is_answered" , "score": 1} 

# Evaluators
qa_evalulator = [is_answered]
dataset_name = "DBRX"

# Run
experiment_results = evaluate(
    answer_dbrx_question_oai,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-dbrx-qa-custom-eval-is-answered",
    # Any experiment metadata can be specified here
    metadata={
      "variant": "stuff website context into gpt-3.5-turbo",
    },
)

7. Comparison

Question:

How does Mistral-7b running locally compare to GPT-3.5-turbo for question-answering?

Setup:

https://github.com/ollama/ollama-python

ollama pull mistral

Instrument Ollama calls with LangSmith:

https://docs.smith.langchain.com/cookbook/tracing-examples/traceable#using-the-decorator

# Mistral

import ollama
from langsmith.run_helpers import traceable

@traceable(run_type="llm")
def call_ollama(messages, model: str):
    stream = ollama.chat(messages=messages, model='mistral', stream=True)
    response = ''
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
        response =  response + chunk['message']['content'] 
    return response

def answer_dbrx_question_mistral(inputs: dict) -> dict:
    """
    Generates answers to user questions based on a provided website text using Ollama serving Mistral locally.

    Parameters:
    inputs (dict): A dictionary with a single key 'question', representing the user's question as a string.

    Returns:
    dict: A dictionary with a single key 'output', containing the generated answer as a string.
    """

    # System prompt 
    system_msg = f"Answer user questions about this context: \n\n\n {full_text}"
    
    # Pass in website text
    messages = [{"role": "system", "content": system_msg},
                {"role": "user", "content": f'Answer the question in 2-3 sentences {inputs["question"]}' }]

    # Call Mistral
    response = call_ollama(messages, model="mistral")

    # Response in output dict
    return {"answer": response} 

result = answer_dbrx_question_mistral({"question":"What are the main differences in training efficiency between MPT-7B vs DBRX?"})

What are we doing?

Screenshot 2024-04-02 at 10.54.28 AM.png

# Evaluators
qa_evalulator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "DBRX"

experiment_results = evaluate(
    answer_dbrx_question_mistral,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-dbrx-qa-mistral",
    # Any experiment metadata can be specified here
    metadata={
      "variant": "stuff website context into mistral",
    },
)

Use comparison view to inspect results.

8. Experiment on datasets from the prompt playground (no code)

We've showed various ways to run evals using the SDK.

But sometimes I want to do more rapid testing.

For this I can use the LangSmith prompt hub directly:

https://docs.smith.langchain.com/evaluation/faq/experiments-app

Here is a problem I've worked on recently:

I want to grade documents in a RAG chain that takes as input: (1) A document and (2) A question.

And returns: (3) JSON with score yes or no that tells me if the documents are related to a question.

See notebooks here.

Screenshot 2024-04-09 at 7.33.15 PM.png

Question:

How do different LLMs perform at instruction following to produce a JSON output?

First, I build a dataset of test examples:

# Define a dataset

import pandas as pd

# relevance check
inputs = [
    {"question":"agent memory","doc_txt":"agent memory has two types: short and long term"},
    {"question":"hallucinations","doc_txt":"DBRX was pretrained on 12T tokens"},
    {"question":"DBRX content window","doc_txt":"DBRX has a 32K token context window"},
]

outputs = [
    "yes",
    "no",
    "yes"
]

from langsmith import Client

client = Client()
dataset_name = "Relevance_grade"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Testing relevance grading.",
)
client.create_examples(
    inputs=inputs,
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

Test prompt in the Prompt Hub.

SYSTEM

You are a grader assessing relevance of a retrieved document to a user question. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. Provide the binary score as a JSON with a single key 'score' and no premable or explaination.

HUMAN

Question: {question} 

Document: {doc_txt}

9. Attach evaluators to datasets (no code)

From part 8, we:

(1) Set up a dataset of test cases for document grading

(2) Ran experiments from the prompt hub

(3) Manually reviewed them

But, we can go one step further:

We can attach an LLM evaluator to our dataset.

This is automatically applied for every experiment.

Grade prompt:

You are a grader. You will be shown: 

(1) Submission: a student submission for a JSON string

(2) Reference: the ground truth value expected in the JSON string

The student is producing a JSON with a single key "score" to indicate whether doc_text is relevant to question for this input:

[Input]: {input}

Grade the student as correct if that the student submission is valid JSON (or a JSON string) and contains the Reference value. If the student submission contains a preamble of text "e.g., 'sure, here is the JSON'" then score that as incorrect because we only want to JSON returned.

[BEGIN DATA]

***

[Submission]: {output}

***

[Reference]: {reference}

***

[END DATA]

10. Instrumenting Unit Tests

Unit tests are often simple assertions that are run as part of CI.

Example:

I've some recent work on code generation (e.g., see this example).

I use function calling to produce a solution with prefix, imports, and code blocks.

Question:

How can I instrument unit tests that check whether the imports and code blocks execute?

Screenshot 2024-04-11 at 12.29.53 PM.png

We'll create an example app, my_app/main.py, that generate a solution with prefix, imports, and code blocks.

Set up example app and test -

# my_app/main.py
# tests/test_my_app.py

Run -

export PYTHONPATH="/Users/rlm/Desktop/Code/langsmith-cookbook:$PYTHONPATH"
pytest

See results logged to LangSmith.

Documentation:

https://docs.smith.langchain.com/evaluation/faq/unit-testing

11. Summary Evaluators

We previously talked about using retrieval grading as part of RAG:

Screenshot 2024-04-09 at 7.33.15 PM.png

In short, we use an LLM to grader whether a document is relevant to input question.

This returns a binary yes or no.

We built an eval set and ground truth is a binary yes or no for each example:

https://smith.langchain.com/public/ad300ffb-8bf5-450a-9c26-1b34481fb709/d

Question:

How can I create a custom metric to summarize performance on this dataset?

Screenshot 2024-04-15 at 11.27.32 AM.png

First, let's set up the two chains we want to compare:

### OpenAI Grader

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

# Data model
class GradeDocuments(BaseModel):
    """Binary score for relevance check on retrieved documents."""

    score: str = Field(description="Documents are relevant to the question, 'yes' or 'no'")

# LLM with function call 
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm_grader = llm.with_structured_output(GradeDocuments)

# Prompt 
system = """You are a grader assessing relevance of a retrieved document to a user question. \n 
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""
grade_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Retrieved document: \n\n {document} \n\n User question: {question}"),
    ]
)

retrieval_grader_oai = grade_prompt | structured_llm_grader

def predict_oai(inputs: dict) -> dict:

    # Returns pydantic object
    grade = retrieval_grader_oai.invoke({"question": inputs["question"], "document": inputs["doc_txt"]})
    return {"grade":grade.score}

### Mistral Grader

from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser

# LLM
llm = ChatOllama(model="mistral", format="json", temperature=0)

prompt = PromptTemplate(
    template="""You are a grader assessing relevance of a retrieved document to a user question. \n 
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n
    If the document contains keywords related to the user question, grade it as relevant. \n
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
    Provide the binary score as a JSON with a single key 'score' and no premable or explaination.""",
    input_variables=["question", "document"],
)

retrieval_grader_mistral = prompt | llm | JsonOutputParser()

def predict_mistral(inputs: dict) -> dict:

    # Returns JSON
    grade = retrieval_grader_mistral.invoke({"question": inputs["question"], "document": inputs["doc_txt"]})
    return {"grade":grade['score']}

Documentation:

https://docs.smith.langchain.com/evaluation/faq/custom-evaluators#summary-evaluators

We can define a custom summary metric over the dataset.

Precision and Recall are common metrics to evaluate a binary clasification:

Precision: True positives (TP) / All positives (TP + False Positives (FP)).
Recall: TP / All samples that should have been identified as positive

F1 considers both the precision and the recall of the test to compute the score:

F1 score is the harmonic mean of precision and recall, and it reaches its best value at 1

from typing import List
from langsmith.schemas import Example, Run
from langsmith.evaluation import evaluate

def f1_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:
    """
    Evaluates the F1 score for a list of runs against a set of examples.

    The function iterates through paired runs and examples, comparing the output
    of each run (`run.outputs["grade"]`) with the expected output in the example
    (`example.outputs["answer"]`). It calculates the true positives, false positives,
    and false negatives based on these comparisons to compute the F1 score of the predictions.

    Parameters:
    - runs (List[Run]): A list of run objects, where each run contains an output that is a prediction.
    - examples (List[Example]): A list of example objects, where each example contains an output that is the expected answer.

    Returns:
    - dict: A dictionary with a single key-value pair where the key is "f1_score" and the value 
    """

    # Default values
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    
    # Iterate through samples
    for run, example in zip(runs, examples):
        reference = example.outputs["answer"]
        prediction = run.outputs["grade"]
        if reference and prediction == reference:
            true_positives += 1
        elif prediction and not reference:
            false_positives += 1
        elif not prediction and reference:
            false_negatives += 1
    if true_positives == 0:
        return {"key": "f1_score", "score": 0.0}
        
    # Compute F1 score
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return {"key": "f1_score", "score": f1_score}

evaluate(
    predict_mistral,
    data="Relevance_grade",
    summary_evaluators=[f1_score_summary_evaluator],
    experiment_prefix="test-score-mistral",
    # Any experiment metadata can be specified here
    metadata={
      "model": "mistral",
    },
)

View the evaluation results for experiment: 'test-score-mistral-c288bf9f' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/1627c2dc-8207-44a1-86a3-bc7e23bd9777/compare?selectedSessions=f6a22a99-5e13-47d5-916a-cdd6b1a0ea38

0it [00:00, ?it/s]

&lt;ExperimentResults test-score-mistral-c288bf9f&gt;

evaluate(
    predict_oai,
    data="Relevance_grade",
    summary_evaluators=[f1_score_summary_evaluator],
    experiment_prefix="test-score-oai",
    # Any experiment metadata can be specified here
    metadata={
      "model": "oai",
    },
)

View the evaluation results for experiment: 'test-score-oai-dccbb67b' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/1627c2dc-8207-44a1-86a3-bc7e23bd9777/compare?selectedSessions=1614d57e-3152-41a9-98e2-ba23c12a0439

0it [00:00, ?it/s]

&lt;ExperimentResults test-score-oai-dccbb67b&gt;

12-14. Evaluating RAG

See our RAG guide.

# Evaluators
qa_evalulator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "RAG_test_LCEL"
    
experiment_results = evaluate(
    predict_rag_answer,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="rag-qa-oai",
    # Any experiment metadata can be specified here
    metadata={
      "variant": "LCEL context, gpt-3.5-turbo",
    },
)

View the evaluation results for experiment: 'rag-qa-oai-166e3567' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/368734fb-7c14-4e1f-b91a-50d52cb58a07/compare?selectedSessions=bc8c1195-1722-4ec9-9b18-45b76c648717

0it [00:00, ?it/s]

15. Regression testing

Previously, we talked about various types of RAG evaluations.

Question:

How can I assess whether a new LLM (e.g., phi3), can I be used in my RAG chain?

For this, regression testing is highly useful.

It lets us easily pinpoint changes in performance in our eval set across model versions.

First, define an eval set:

import os
os.environ['LANGCHAIN_PROJECT'] = 'RAG_bot_langsmith_online_eval'

from langsmith import Client 

# QA
inputs = [
    "My LCEL map contains the key 'question'. What is the difference between using itemgetter('question'), lambda x: x['question'], and x.get('question')?",
    "How can I make the output of my LCEL chain a string?",
    "How can I run two LCEL chains in parallel and write their output to a map?"
]

outputs = [
    "Itemgetter can be used as shorthand to extract specific keys from the map. In the context of a map operation, the lambda function is applied to each element in the input map and the function returns the value associated with the key 'question'. (get) is safer for accessing values in a dictionary because it handles the case where the key might not exist.",
    "Use StrOutputParser. from langchain_openai import ChatOpenAI; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.output_parsers import StrOutputParser; prompt = ChatPromptTemplate.from_template('Tell me a short joke about {topic}'); model = ChatOpenAI(model='gpt-3.5-turbo') #gpt-4 or other LLMs can be used here; output_parser = StrOutputParser(); chain = prompt | model | output_parser",
    "We can use RunnableParallel. For example: from langchain_core.prompts import ChatPromptTemplate; from langchain_core.runnables import RunnableParallel; from langchain_openai import ChatOpenAI; model = ChatOpenAI(); joke_chain = ChatPromptTemplate.from_template('tell me a joke about {topic}') | model; poem_chain = (ChatPromptTemplate.from_template('write a 2-line poem about {topic}') | model); map_chain = RunnableParallel(joke=joke_chain, poem=poem_chain); map_chain.invoke({'topic': 'bear'})"
]

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Create dataset
client = Client()
dataset_name = "RAG_QA_LCEL"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about LCEL.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

RAG chain:

### INDEX

from bs4 import BeautifulSoup as Soup
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

# Load
url = "https://python.langchain.com/docs/expression_language/"
loader = RecursiveUrlLoader(url=url, max_depth=20, extractor=lambda x: Soup(x, "html.parser").text)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Embed
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Index
retriever = vectorstore.as_retriever()

### RAG 

import openai
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser

class RagBot:
    def __init__(self, retriever, provider: str = "openai", model: str = "gpt-4-0125-preview"):
        self._retriever = retriever
        self._provider = provider
        self._model = model
        if provider == "openai":
            self._client = wrap_openai(openai.Client())
        elif provider == "ollama":
            self._client = ChatOllama(model=model, temperature=0)
        
    def retrieve_docs(self, question):
        return self._retriever.invoke(question)

    @traceable()
    def get_answer(self, question: str):
        similar = self.retrieve_docs(question)
        if self._provider == "openai":
            "OpenAI RAG"
            response = self._client.chat.completions.create(
                model=self._model,
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful AI code assistant with expertise in LCEL.\n"
                        " Use the following docs to produce a concise code solution to the user question.\n"
                        " Use three sentences maximum and keep the answer concise. \n"
                        f"## Docs\n\n{similar}",
                    },
                    {"role": "user", "content": question},
                ],
            )
            response_str = response.choices[0].message.content
            
        elif self._provider == "ollama":
            "Ollama RAG"
            prompt = PromptTemplate(
                template="""You are a helpful AI code assistant with expertise in LCEL.
                Use the following docs to produce a concise code solution to the user question.
                If you don't know the answer, just say that you don't know. 
                Use three sentences maximum and keep the answer concise.
                Question: {question} 
                Context: {context} 
                Answer: """,
                input_variables=["question", "context"],
            )
            rag_chain = prompt | self._client | StrOutputParser()    
            response_str = rag_chain.invoke({"context":similar,"question":question})

        return {
            "answer": response_str,
            "contexts": [str(doc) for doc in similar],
        }

def predict_rag_answer_oai(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot(retriever,provider="openai",model="gpt-4-0125-preview")
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_llama3(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot(retriever,provider="ollama",model="llama3")
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

def predict_rag_answer_phi3(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot(retriever,provider="ollama",model="phi3")
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

Define evaluator:

from langsmith.evaluation import LangChainStringEvaluator, evaluate

answer_evaluator = LangChainStringEvaluator(
    "labeled_score_string", 
    config={
        "criteria": { 
            "accuracy": """Is the Assistant's Answer grounded in and similar to the Ground Truth answer? A score of [[1]] means that the
            Assistant answer is not at all grounded in and similar to the Ground Truth answer. A score of [[5]] means  that the Assistant 
            answer contains some information that is grounded in and similar to the Ground Truth answer. A score of [[10]] means that the 
            Assistant answer is fully grounded in and similar to the Ground Truth answer."""
        }, 
        # If you want the score to be saved on a scale from 0 to 1
        "normalize_by": 10,
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"], 
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }  
)

from langsmith.evaluation import LangChainStringEvaluator, evaluate

dataset_name = "RAG_QA_LCEL"
experiment_results = evaluate(
    predict_rag_answer_oai,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-gpt4-0125",
    metadata={"variant": "LCEL context, gpt-4-0125-preview"},
)

View the evaluation results for experiment: 'rag-qa-gpt4-0125-6b799ce4' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/185fcfba-a9cb-4868-9904-86644881d363/compare?selectedSessions=98981d76-b10f-4856-b6be-ed86f035b34c

0it [00:00, ?it/s]

experiment_results = evaluate(
    predict_rag_answer_llama3,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-llama3",
    metadata={"variant": "LCEL context, gpt-4-0125-preview"},
)

View the evaluation results for experiment: 'rag-qa-llama3-ad040c84' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/185fcfba-a9cb-4868-9904-86644881d363/compare?selectedSessions=c78dc2a7-b99c-48bf-9613-70fe53dae303

0it [00:00, ?it/s]

experiment_results = evaluate(
    predict_rag_answer_phi3,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-phi3",
    metadata={"variant": "LCEL context, phi3"},
)

View the evaluation results for experiment: 'rag-qa-phi3-d8ebf147' at:
https://smith.langchain.com/o/1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8/datasets/185fcfba-a9cb-4868-9904-86644881d363/compare?selectedSessions=a71ffea1-b2b2-4e3b-963f-5f31de646083

0it [00:00, ?it/s]

17. Online Evaluators

Sometimes we want to evaluate generations as they are logged to a project.

# Test our RAG bot
rag_bot = RagBot(retriever,provider="openai",model="gpt-4-0125-preview")
response = rag_bot.get_answer("How to define an RAG chain in LCEL?")

LangSmith Evaluation Deep Dive

Preface​

Summary​

Enviornment​

3. Dataset: Manually Curated

4. Dataset: From User Logs

5. LLM-as-Judge: Built-in evaluator

6. Custom evaluator

7. Comparison

8. Experiment on datasets from the prompt playground (no code)​

9. Attach evaluators to datasets (no code)​

10. Instrumenting Unit Tests​

11. Summary Evaluators​

12-14. Evaluating RAG

15. Regression testing

17. Online Evaluators​

Was this page helpful?

Preface

Summary

Enviornment

8. Experiment on datasets from the prompt playground (no code)

9. Attach evaluators to datasets (no code)

10. Instrumenting Unit Tests

11. Summary Evaluators

17. Online Evaluators