Bootstrap Few-shot Prompting with LangSmith

Prompt engineering is a pain. You can use examples to optimize the prompt for you with the help of tools like LangSmith. Instead of guessing which examples will be the most impactful, you can use tried-and-true evaluation practices to curate and compile the right examples for your pipeline. The main steps are:

Create a dataset
Pick a metric to improve
Create an initial system
Decide the update logic (few-shot examples vs. instruction teaching vs. other methods, how to format the examples, etc.)
Train!

Below is an example bootstrapping a gpt-3.5-turbo model on an entailment task using few-shot examples. This example inspired by Christopher Potts' example on the SCONE dataset.

The task is natural language inference, where the LLM is required to predict whether the a statement can be logically concluded from a premise / grounding statement.

%pip install -U langsmith langchain langchain_openai pandas

import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
os.environ["OPENAI_API_KEY"] = "YOUR API KEY"

# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
from langchain_core.globals import set_llm_cache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

from langsmith import Client

client = Client()

public_datasets = [
    "https://smith.langchain.com/public/1d065de2-56c1-496e-bc66-bdce308e6537/d",  # train
    "https://smith.langchain.com/public/3205fa05-bd78-4eaf-924f-96df0f577b1f/d",  # train2
    "https://smith.langchain.com/public/fdf16166-1edd-418f-b777-3af82034931d/d",  # dev
    "https://smith.langchain.com/public/aee61506-3c60-4ca8-95c4-0314c9719ca8/d",  # dev2
    "https://smith.langchain.com/public/8d40d210-f8e6-4def-a206-78c5080c5d53/d",  # test
]
for ds in public_datasets:
    client.clone_public_dataset(ds)
train_name = "scone-train2"
dev_name = "scone-dev2"
test_name = "scone-test-one-scoped"
full_test_name = "scone-test"

example = next(client.list_examples(dataset_name=train_name))
print("inputs", example.inputs)
print("outputs", example.outputs)

inputs {'context': 'A man who does not walk confidently dropping produce.', 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}
outputs {'answer': 'No', 'category': 'one_not_scoped'}

Reviewing the values above, these examples can be tricky!

Evaluator

Since we have ground-truth clasification labels, we can use an exact-match criterion as our evaluator.

import sys

from langsmith.evaluation import run_evaluator


@run_evaluator
def exact_match(run, example):
    # Evaluate the exact match correctness of the NLI result
    try:
        predicted = run.outputs["is_entailed"]
        expected = example.outputs["answer"]
        score = expected.lower() == predicted.lower()
    except Exception as e:
        try:
            expected = example.outputs["answer"]
            expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
            score = run.outputs["output"].is_entailed == expected_bool
        except Exception as e2:
            score = 0
    return {
        "key": "exact_match",
        "score": int(score),
    }

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

# And we will create a placeholder in the template to add few-shot examples
prompt = PromptTemplate.from_template(
    """You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${{context}}

Question: ${{question}}

Reasoning: Let's think step by step in order to ${{produce the answer}}. We ...

Answer: Yes or No

---{examples}

Context: {context}

Question: {question}

Reasoning: Let's think step by step in order to"""
).partial(examples="")


def parse(pred: str):
    fnd = "\nAnswer:"
    idx = pred.find(fnd)
    answer = pred[idx + len(fnd) :].strip()
    return {"is_entailed": answer, "reasoning": pred[:idx].strip()}


chain = prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser() | parse

prediction = chain.invoke(example.inputs)
prediction

{'is_entailed': 'No',
 'reasoning': 'produce the answer. We know that the man does not walk confidently and drops produce. However, dropping produce does not necessarily mean he drops kale specifically. He could be dropping any type of produce.'}

Initial Evaluation

from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    custom_evaluators=[exact_match],
)

res = client.run_on_dataset(
    dataset_name="scone-test2",  # dev_name,
    llm_or_chain_factory=chain,
    evaluation=eval_config,
    project_metadata={"optimizer": None},
)

View the evaluation results for project 'passionate-copy-48' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd/compare?selectedSessions=bb3d33aa-53a1-4d63-8b79-3758df4b1fb7

View all tests for Dataset scone-test2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd
[-------------------------------------------------&gt;] 200/200

Got about ~55% on it. Definitely room for improvement.

✨ Optimize ✨

This just means to "use data to update the system". At present, LangChain runnables don't natively support a "backwards" method (a la pytorch), but you can pretty easily define updates/mutations for key important components you'd want to update, (such as prompts or LLMs).

For instance, component-wise, you could apply:

Few shot prompting: add an additional string input or MessagesPlaceholder in the prompt template
Updating the instructions: update the prompt template directly (likely the system prompt)
LLM: do a backwards pass.

We will focus on few-shot prompting to limit the search space. We will then apply a genetic/evolutionary algorithm to compare performance of different few-shot examples and pick the ones that provide the most "lift" of the provided metric.

We'll first create a constructor for our chain that accepts the few-shot examples, letting us re-create the chain with each updated state.

# We will define how we want our few-shot examples to be formatted
import random
from typing import List, Optional

from langchain_core.runnables import RunnableLambda


def format_example(example: dict):
    inputs = example["input"]
    outputs = example["output"]
    return f"""

Context: {inputs['context']}

Question: {inputs['question']}

Reasoning: {outputs['reasoning']}

Answer: {outputs['is_entailed']}

"""


def format_few_shot(input_: dict, examples: Optional[List[dict]] = None):
    if examples:
        # TODO: make this configurable / bound to the prompt template
        input_["examples"] = (
            "--".join(format_example(e) for i, e in enumerate(examples)) + "--"
        )
    return input_


def create_chain(examples: Optional[List] = None, llm=None):
    llm = llm or ChatOpenAI(model="gpt-3.5-turbo")
    chain = (
        RunnableLambda(format_few_shot).bind(examples=examples)
        | prompt
        | llm
        | StrOutputParser()
        | parse
    ).with_config(tags=["to_train"])
    return chain

Training

Next, we'll define the training utilities.

from langchain_core.tracers.context import collect_runs


def step(
    construct_chain,
    train_examples,
    eval_config,
    examples=None,
    bootstrap_k: int = 8,
):
    collected = examples.copy() if examples else []
    random.shuffle(train_examples)
    train_examples = train_examples.copy()
    # TODO: Batching to speed it up
    while train_examples:
        if len(collected) >= bootstrap_k:
            break
        train_batch = [
            train_examples.pop() for _ in range(bootstrap_k - len(collected))
        ]
        chain = construct_chain([e for e in collected if e["id"] != example.id])
        with collect_runs() as cb:
            chain.batch([e.inputs for e in train_batch])
        evaluator = eval_config.custom_evaluators[0]
        for run, example in zip(cb.traced_runs, train_batch):
            metric = evaluator.evaluate_run(run, example)
            score = metric.score
            # Check if success
            if score:
                collected.append(
                    {
                        "input": example.inputs,
                        "output": run.outputs,
                        "id": example.id,
                    }
                )
    return collected


def eval(eval_dataset, chain, eval_config, step_n) -> float:
    """Compute the metrics on the validation dataset."""
    dev_results = client.run_on_dataset(
        dataset_name=eval_dataset,
        llm_or_chain_factory=chain,
        evaluation=eval_config,
        verbose=True,
        concurrency_level=8,
        project_metadata={
            "step": step_n,
        },
    )
    df = dev_results.to_dataframe()
    feedback_key = [c for c in df.columns if c.startswith("feedback.")][0]
    # Assume single metric rn ha
    return df[feedback_key].mean()


def train(
    chain_constructor,
    train_dataset,
    eval_dataset,
    eval_config,
    steps: int = 5,
    k: int = 8,
    bootstrap_k: int = 8,
):
    """Run the full training loop"""
    best_score = eval(eval_dataset, chain_constructor(), eval_config, 0)
    best_step = 0
    scores = [(best_score, [])]
    train_examples = list(client.list_examples(dataset_name=train_dataset))
    for step_number in range(steps):
        collected = step(
            chain_constructor, train_examples, eval_config, bootstrap_k=bootstrap_k
        )
        if len(collected) < k:
            # TODO: probably want some diversity of labels here lol
            to_sample = min(k - len(collected), len(train_examples))
            collected += random.sample(train_examples, to_sample)
        selected_examples = collected
        updated_chain = chain_constructor(examples=selected_examples)
        updated_score = eval(eval_dataset, updated_chain, eval_config, step_number + 1)
        scores.append((updated_score, selected_examples))

        if updated_score > best_score:
            print(
                f"New best score {updated_score} > {best_score}. Updating selected examples."
            )
            best_score = updated_score
            best_step = step_number + 1
        else:
            print("Underperformed. Continuing")
    print("Best overall score: ", best_score)
    print("Best step: ", best_step)
    return sorted(scores, key=lambda x: x[0], reverse=True)

Train

Now we can finally run the training loop!

import functools

# We will train with gpt-4-turbo
llm = ChatOpenAI(model="gpt-4-turbo-preview")
all_scores = train(
    functools.partial(create_chain, llm=llm),
    train_name,
    dev_name,
    eval_config,
    steps=10,
)

View the evaluation results for project 'bold-show-44' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=0478dc12-5f1a-4d1b-84d6-95699f05bf77

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.00000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	e45cdb67-3ae6-48b6-9db1-6fe09e39e6a3
NaN	NaN	NaN	1
0.86000	NaN	0.021456	NaN
0.35051	NaN	0.011425	NaN
0.00000	NaN	0.007727	NaN
1.00000	NaN	0.013763	NaN
1.00000	NaN	0.019525	NaN
1.00000	NaN	0.023224	NaN
1.00000	NaN	0.059278	NaN

View the evaluation results for project 'giving-record-97' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=c181b376-6214-4130-8d6e-87ee7c0cfd5f

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.00000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	ef1483cc-1040-4ebb-a0b0-f770bc9411c5
NaN	NaN	NaN	1
0.86000	NaN	9.071231	NaN
0.35051	NaN	4.016930	NaN
0.00000	NaN	4.513033	NaN
1.00000	NaN	6.605231	NaN
1.00000	NaN	7.932223	NaN
1.00000	NaN	10.160974	NaN
1.00000	NaN	24.512853	NaN

Underperformed. Continuing
View the evaluation results for project 'proper-man-52' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=13f9f137-b12b-41c8-bc51-fc65aed67594

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-----------------------&gt;                          ] 24/50

Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': 'You requested a model that is not compatible with this engine. Please contact us through our help center at help.openai.com for further questions.', 'type': 'invalid_request_error', 'param': 'model', 'code': None}}


[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
49.000000	1	50.000000	50
NaN	1	NaN	50
NaN	Error code: 400 - {'error': {'message': 'You r...	NaN	c3388800-20aa-4c72-8e1c-f96632355fcf
NaN	1	NaN	1
0.836735	NaN	10.026921	NaN
0.373438	NaN	4.115617	NaN
0.000000	NaN	0.559937	NaN
1.000000	NaN	7.325939	NaN
1.000000	NaN	9.343092	NaN
1.000000	NaN	11.909372	NaN
1.000000	NaN	24.057484	NaN

Underperformed. Continuing
View the evaluation results for project 'proper-quiet-36' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=c6f18469-7df3-41d5-bd70-10ee4a076182

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[----------------------------&gt;                     ] 29/50

Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': 'You requested a model that is not compatible with this engine. Please contact us through our help center at help.openai.com for further questions.', 'type': 'invalid_request_error', 'param': 'model', 'code': None}}


[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
49.000000	1	50.000000	50
NaN	1	NaN	50
NaN	Error code: 400 - {'error': {'message': 'You r...	NaN	ac830a9d-4169-49b6-a843-0f4afe138865
NaN	1	NaN	1
0.897959	NaN	7.242384	NaN
0.305839	NaN	2.108956	NaN
0.000000	NaN	0.525809	NaN
1.000000	NaN	6.170674	NaN
1.000000	NaN	6.969927	NaN
1.000000	NaN	8.018508	NaN
1.000000	NaN	12.737470	NaN

New best score 0.8979591836734694 &gt; 0.86. Updating selected examples.
View the evaluation results for project 'advanced-competition-88' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=31ece295-31c4-4c3c-b9f0-a1df3dd09adb

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.00000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	e2d59128-29e4-4562-bc11-93bb60738953
NaN	NaN	NaN	1
0.86000	NaN	8.488865	NaN
0.35051	NaN	4.301064	NaN
0.00000	NaN	3.736222	NaN
1.00000	NaN	6.037187	NaN
1.00000	NaN	6.998608	NaN
1.00000	NaN	9.773248	NaN
1.00000	NaN	26.641730	NaN

Underperformed. Continuing
View the evaluation results for project 'drab-print-47' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=70686baf-1859-4bcf-91b3-82c41843cd86

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.000000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	1bd0827b-b405-4bdc-8eb0-ed3105d94e4d
NaN	NaN	NaN	1
0.900000	NaN	10.443896	NaN
0.303046	NaN	13.421476	NaN
0.000000	NaN	4.744148	NaN
1.000000	NaN	6.975307	NaN
1.000000	NaN	8.340018	NaN
1.000000	NaN	9.440450	NaN
1.000000	NaN	101.049986	NaN

New best score 0.9 &gt; 0.8979591836734694. Updating selected examples.
View the evaluation results for project 'impressionable-writer-19' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=1f31eff6-8ab8-4b16-baa5-6f3669f4dead

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.000000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	041fd757-fb44-4a79-8dcf-d0ab006622f1
NaN	NaN	NaN	1
0.880000	NaN	7.219473	NaN
0.328261	NaN	2.151543	NaN
0.000000	NaN	3.604611	NaN
1.000000	NaN	5.412153	NaN
1.000000	NaN	7.344393	NaN
1.000000	NaN	8.157682	NaN
1.000000	NaN	13.777614	NaN

Underperformed. Continuing
View the evaluation results for project 'drab-map-24' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=aa3fb10d-f9a7-47ac-a90d-c385085339fc

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.000000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	e8f88ef2-8d1e-4323-ac51-0c7ba1c6b0fd
NaN	NaN	NaN	1
0.880000	NaN	7.352010	NaN
0.328261	NaN	2.876893	NaN
0.000000	NaN	3.442488	NaN
1.000000	NaN	5.508052	NaN
1.000000	NaN	6.563693	NaN
1.000000	NaN	8.169192	NaN
1.000000	NaN	17.694664	NaN

Underperformed. Continuing
View the evaluation results for project 'best-step-66' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=1d7c26de-3ae1-470e-8c51-9b2873a442c9

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.000000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	31e30bda-a245-4f68-8596-03183b8ffcc3
NaN	NaN	NaN	1
0.920000	NaN	8.322146	NaN
0.274048	NaN	2.587044	NaN
0.000000	NaN	5.140714	NaN
1.000000	NaN	6.780764	NaN
1.000000	NaN	7.700001	NaN
1.000000	NaN	9.086863	NaN
1.000000	NaN	19.068444	NaN

New best score 0.92 &gt; 0.9. Updating selected examples.
View the evaluation results for project 'brief-color-26' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=4b090fa5-87cf-4bab-8f90-d86d91102240

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.00000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	bd2fe2a3-cb39-4287-9c79-ba214bcdae40
NaN	NaN	NaN	1
0.86000	NaN	9.189128	NaN
0.35051	NaN	5.716492	NaN
0.00000	NaN	4.791341	NaN
1.00000	NaN	6.648413	NaN
1.00000	NaN	7.485603	NaN
1.00000	NaN	9.478416	NaN
1.00000	NaN	41.826824	NaN

Underperformed. Continuing
View the evaluation results for project 'worthwhile-rabbit-93' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=c8676b03-e009-4a3b-aa50-1f16a4476dbf

View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-------------------------------------------------&gt;] 50/50

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
50.000000	0	50.000000	50
NaN	0	NaN	50
NaN	NaN	NaN	83776c8b-5772-4521-8b30-17b1cc5defca
NaN	NaN	NaN	1
0.880000	NaN	8.748563	NaN
0.328261	NaN	4.640876	NaN
0.000000	NaN	5.161556	NaN
1.000000	NaN	7.018997	NaN
1.000000	NaN	7.690480	NaN
1.000000	NaN	9.327333	NaN
1.000000	NaN	37.731715	NaN

Underperformed. Continuing
Best overall score:  0.92
Best step:  8

Compare on held-out set

It's easy to overfit a single benchmark if you explicitly choose your pipeline based on metrics on that benchmark.

Let's compare models on an unseen test set to see whether the selected examples are reliably better.

best_score, best_examples = all_scores[0]

original_model = create_chain()
# This time we will apply gpt-3.5-turbo, but use the few-shot examples + reasoning trajectories
# from gpt-4 to help induce better performance
best_performing_model = create_chain(best_examples)

for model_name, model in [
    ("optimized", best_performing_model),
    # ("original", original_model),
]:
    client.run_on_dataset(
        dataset_name=test_name,
        llm_or_chain_factory=model,
        evaluation=eval_config,
        verbose=True,
        project_metadata={
            "model": model_name,
        },
    )

View the evaluation results for project 'shiny-ship-82' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd/compare?selectedSessions=368a8216-6462-4d19-8261-9709fe301b19

View all tests for Dataset scone-test2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd
[-------------------------------------------------&gt;] 200/200

<h3>Experiment Results:</h3>

feedback.exact_match	error	execution_time	run_id
200.000000	0	200.000000	200
NaN	0	NaN	200
NaN	NaN	NaN	2ab8873e-b142-4f3f-a970-0ca693ce12c2
NaN	NaN	NaN	1
0.870000	NaN	1.772289	NaN
0.337147	NaN	0.341076	NaN
0.000000	NaN	1.205090	NaN
1.000000	NaN	1.547561	NaN
1.000000	NaN	1.718797	NaN
1.000000	NaN	1.897174	NaN
1.000000	NaN	3.934606	NaN

Using the GPT-4 generated examples, we were able to boost the performance from ~0.54 to ~0.87: not bad!

Bootstrap Few-shot Prompting with LangSmith

Evaluator

Initial Evaluation

✨ Optimize ✨

Training

Train

Compare on held-out set

Was this page helpful?

You can leave detailed feedback on GitHub.

Bootstrap Few-shot Prompting with LangSmith

Evaluator​

Initial Evaluation​

✨ Optimize ✨​

Training​

Train​

Compare on held-out set​

Was this page helpful?

You can leave detailed feedback on GitHub.

Evaluator

Initial Evaluation

✨ Optimize ✨

Training

Train

Compare on held-out set