Few-shot prompting with LangSmith datasets
Datasets are useful for more than just testing, evaluation, and fine-tuning. They can also be used to curate examples for few-shot "learning." The overall flow for this looks something like:
- Capture samples by tracing a chain or LLM, either in a live deployment or on example data.
- Filter samples based on user feedback or other custom criteria (did/didn't error, AI-assisted evaluation, etc.)
- Create dataset(s) from filtered samples.
- Use within ExampleSelectors and FewShotPromptTemplates to improve in-context learning.
- Repeat :)
This can help improve the quality of the model, permit usage of cheaper models, and more.
As a motivating example, we will show how we can use a few examples from a chain powered by OpenAI's
gpt4
model to improve the quality of a "history Q&A" bot powered by the localGPT4All
model.
Setup
First, we will need to install LangChain and GPT4All
pip install -U langchain langchain_openai langsmith gpt4all
Then configure your environment:
export LANGCHAIN_API_KEY=<your key>
export LANGCHAIN_TRACING_V2_ENABLED=true
export OPENAI_API_KEY=<your key>
Step 1: Trace Prototype Model
Run your chain prototype to collect example traces. This data could come from a live deployment, staging environment, prototyping dataset, or any other source. If you already have some traced data you believe to be good candidates for few-shot prompting, you can skip this step. To get started quickly, we will use some example questions we want to ask our chain.
Create a dataset
The 'History of Flanders' dataset here is used mainly for convenience. We use it in the evaluate
method to quickly capture traces. You can also select runs from any project
for the few-shot dataset.
from langsmith import Client
client = Client()
questions = [
"When and where was the printing press invented by Johannes Gutenberg?",
"What were the most significant contributions of Flemish painters of the Renaissance?",
"What are the main characteristics of the Flemish language?",
"What were the main causes and consequences of the Flemish Revolt?",
"What are some of the most important works of Flemish literature in the 19th and 20th centuries?",
"What are some of the most important films produced in Flanders in the 20th century?",
"What were the key goals of the Flemish Movement?",
"What are the main responsibilities of the Flemish Community?",
"What are the main drivers of the Flemish economy in the 21st century?",
"What are the main arguments for and against Flemish independence?",
"What are the main challenges facing Flanders in the 21st century?",
"How did the invention of the printing press by Johannes Gutenberg impact Flemish history?",
"How did the development of Flemish painting in the 15th and 16th centuries impact Flemish history?",
"How did the development of the Flemish language impact Flemish history?",
"How did the Flemish Revolt (1568-1609) impact Flemish history?",
"How did the development of Flemish literature in the 19th and 20th centuries impact Flemish history?",
"How did the development of Flemish cinema in the 20th century impact Flemish history?",
"How did the development of the Flemish Movement (19th-20th centuries) impact Flemish history?",
"How did the creation of the Flemish Community (1970) impact Flemish history?",
"How did the development of the Flemish economy in the 21st century impact Flemish history?",
"How did the rise of Flemish nationalism in the 21st century impact Flemish history?",
]
shared_dataset_name = "History of Flanders"
ds = client.create_dataset(
dataset_name=shared_dataset_name, description="Some questions about Flanders",
)
client.create_examples(
inputs=[{"input": q} for q in questions], dataset_id=ds.id
)
Run chain over dataset
We will use a gpt-4 powered LLMChain and the evaluate
method to kick things off.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langsmith.evaluation import evaluate
def prototype_predict(inputs):
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0.0)
prompt = PromptTemplate.from_template(
template="Help out as best you can.\nQuestion: {input}\nResponse: ",
)
chain = prompt | llm
return chain.invoke({"input": inputs["input"]})
prototype_project_name = "History Prototype Test"
prototype_results = evaluate(
prototype_predict,
data=shared_dataset_name,
experiment_prefix=prototype_project_name,
)
Step 2: Create Few-Shot Dataset
The traces you've captured can be used for few-shot example prompting! Below, add the inputs and outputs from the traced runs to a dataset. You can review the data in the web app to delete or edit examples you don't like. You could also add or filter by feedback to make sure the dataset(s) capture the style you want to use. For instance, you may want to put good and bad examples in separate datasets and include examples of both in your final prompt.
few_shot_dataset_name = "History Few Shot Dataset"
few_shot_dataset = client.create_dataset(few_shot_dataset_name)
runs = client.list_runs(
project_name=prototype_results.experiment_name,
run_type="chain",
)
for run in runs:
client.create_example_from_run(run, dataset_id=few_shot_dataset.id)
Step 3: Establish Baseline
We're about ready to try out our few-shot prompting model. But first, we want to establish a baseline. Otherwise, we won't be able to tell if it actually helped! Let's take Nomic's GPT4All model to power a private, local history QA bot. To see how well it does, we will create a "development dataset" to see its responses. Once we have a baseline, we can try changing the prompts, models, or other parameters to get better results.
dev_questions = [
"What was the significance of the adoption of Christianity in Armenia in 301 AD?",
"Who was responsible for the creation of the Armenian Alphabet in 405 AD and why was it significant?",
"Can you describe some of the major accomplishments during the Golden Age of Armenian Art and Literature?",
"What is unique about the architecture of the Zvartnots Cathedral in Armenia?",
"Who wrote 'The Knight in the Panther's Skin' and why is it considered a masterpiece of Georgian literature?",
"What cultural developments took place in Georgia during the rule of Queen Tamar?",
"Can you describe the style and significance of Armenian Miniature Painting developed in the 13th to 14th centuries?",
"What are some key characteristics of the Georgian Renaissance in terms of architecture and arts?",
"What makes the Svetitskhoveli Cathedral an iconic symbol of Georgian architectural style?",
"How did the Mkhitarist Order contribute to the preservation of Armenian culture and literature?",
"How did the creation of the Armenian alphabet impact the cultural and literary development of Armenia?",
"What factors contributed to the Golden Age of Armenian Art and Literature?",
"What was the cultural and religious significance of the Zvartnots Cathedral in Armenia?",
"How did 'The Knight in the Panther's Skin' reflect Georgian cultural values and beliefs?",
"What kind of cultural and artistic advancements were made during the rule of Queen Tamar in Georgia?",
"What is the significance of Armenian Miniature Painting in the context of medieval art?",
"How did the Georgian Renaissance influence subsequent architectural and artistic styles in Georgia?",
"What are some of the key architectural features of the Svetitskhoveli Cathedral in Georgia?",
"How did the establishment of the Mkhitarist Order impact Armenian diaspora communities in Europe?",
"How did the religious and cultural changes in Armenia and Georgia from the 4th to the 18th centuries influence their respective art and architecture?"
]
dev_dataset_name = "History Dev Set"
dev_dataset = client.create_dataset(
dataset_name=dev_dataset_name, description="Some history questions.",
)
client.create_examples(inputs=[{"input": q} for q in dev_questions], dataset_id=dev_dataset.id)
Define the chain to benchmark
from langchain_community.llms import GPT4All
from langchain_core.prompts import PromptTemplate
gpt4all_model = GPT4All(model="orca-mini-3b.ggmlv3.q4_0.bin")
base_prompt = PromptTemplate.from_template(
template="Help out as best you can.\nQuestion: {input}\nResponse: ",
)
def benchmark_predict(inputs):
return (base_prompt | gpt4all_model).invoke(inputs)
Run the chain
Run the chain on the dev dataset and inspect the results. Define some AI-assisted evaluators to give some visibility on the quality here.
from langsmith.evaluation import LangChainStringEvaluator
from langsmith.schemas import Example, Run
def labeled_criteria_data(run: Run, example: Example):
return {
"prediction": run.outputs["text"],
"input": example.inputs["input"],
}
helpfulness_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": "helpfulness"},
prepare_data=labeled_criteria_data,
)
completeness_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": "Is the submission complete, providing adequate depth in its answer?"},
prepare_data=labeled_criteria_data,
)
original_dev_res = evaluate(
benchmark_predict,
data=dev_dataset_name,
evaluators=[helpfulness_evaluator, completeness_evaluator],
experiment_prefix="zero-shot chain test",
)
You can view some of the evaluation feedback by reading the project.
# More details are visible in the web app, but you can see feedback stats
# directly using the SDK
original_dev_res.aggregate_feedback
Step 4: Create Few-Shot Example Model
That did reasonably well, but I think we can do better. We liked the style of response provided by the prototype model and want to achieve a similar style with the local example. We will accomplish this with a few-shot example selector. In the example below, fetch the examples from the few-shot dataset created previously and add them to a vector store.
from langchain_core.prompts import SemanticSimilarityExampleSelector
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings()
examples = [
{"input": example.inputs["input"], "output": example.outputs["text"]}
for example in
client.list_examples(dataset_name=few_shot_dataset_name)
]
to_vectorize = [
" ".join(example.values())
for example in examples
]
vectorstore = Chroma.from_texts(
to_vectorize, embeddings, metadatas=examples
)
example_selector = SemanticSimilarityExampleSelector(
vectorstore=vectorstore
)
The FewShotPromptTemplate is used to inject examples into the prompt using the example_selector created above. In this case, we used a SemanticSimilarityExampleSelector, which returns examples that are the most 'similar' based on vector similarity. The overall flow looks something like:
- An input is passed to the LLMChain.
- The input is passed from the FewShotPromptTemplate to the example_selector, which returns the top N most 'similar' examples.
- The
example_prompt
is used to format each selected example. - The selected examples are inserted between the prefix and suffix to form the final prompt.
from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
# Define how each selected example will be formatted
# when inserted into the whole prompt
example_prompt = PromptTemplate(
input_variables=["input", "output"],
template="Question: {input}\nResponse: {output}\n\n",
)
# Define the overall prompt.
few_shot_prompt = FewShotPromptTemplate(
example_selector=example_selector,
example_prompt=example_prompt,
prefix="Help out as best you can.\n",
suffix="Question: {input}\nResponse: ",
input_variables=["input"]
)
Benchmark the few-shot model
Now we can run the chain again, this time using the few-shot prompt template. We will use the same evaluators as before to compare the results.
def fewshot_predict(inputs):
return (few_shot_prompt | gpt4all_model).invoke(inputs)
few_shot_dev_res = evaluate(
fewshot_predict,
data=dev_dataset_name,
evaluators=[helpfulness_evaluator, completeness_evaluator],
experiment_prefix="few-shot chain test",
)
Compare results
Both the helpfulness and completeness of the responses improved!
few_shot_dev_res.aggregate_feedback
Review
In this example, we used LangSmith datasets to curate examples and connected them to a few-shot example selector to improve the quality of the prompt used with a smaller local model. This tactic can be extended in many ways to help improve the quality, style, API awareness, and other characteristics of your chain or agent without having to fine-tune the underlying LLM weights.