Working with feedback
This guide will walk you through feedback in LangSmith. For more end-to-end examples incorporating feedback into a workflow, see the LangSmith Cookbook.
Feedback comes in two forms: automated and human. Both are key to delivering a consistently high quality application experience.
Automated feedback is generated any time you use the run_on_dataset
method to test your component on a dataset or any time you call client.evaluate_run(run_id, run_evaluator)
. This guide will focus on human feedback using a common workflow.
There are two ways to log human feedback:
- Clicking "Rate Run" on a run's trace in the web app directly
- Using the LangSmith client (or REST API).
Below is the method signature from the client.
- Python
- TypeScript
from langsmith import Client
client = Client()
client.create_feedback(
"<run_id>",
"<feedback_key>",
*,
score: float | int | bool | None = None,
value: float | int | bool | str | dict | None = None,
correction: str | dict | None = None,
comment: str | None = None,
source_info: Dict[str, Any] | None = None,
feedback_source_type: "API" | "MODEL" = "API",
)
import { Client } from "langsmith";
const client = new Client();
await client.createFeedback("<run_id>", "<feedbackKey>", {
score?: float | int | bool,
value?: float | int | bool | string | object,
correction?: string | object,
comment?: string,
sourceInfo?: object,
feedbackSourceType?: "API" | "MODEL"
});
Feedback requires a "key" as a name for the feedback and a run_id to assign the feedback to. Typically, you will also want a "score" to characterize measure the feedback and facilitate comparisons. Feedback can contain any of the following optional fields:
- score: The score to rate this run on the metric or aspect.
- value: The display value or non-numeric value for this feedback.
- correction: The correct ground truth for this run.
- comment: A comment about this feedback, or additional reasoning about why it received this score.
- source_info: Information about the source of this feedback (e.g., a user ID, model type, tags, etc.)
- feedback_source_type: The type of feedback source, either API or MODEL.
Example workflow
In addition to general production observability, user feedback is a key ingredient to continually improving your LLM application. A simple workflow for this is:
- Enable tracing and use the LangSmith client to save user feedback.
- Filter runs based on feedback and other automated metrics.
- Export runs to a dataset to use for evaluation and training.
1. Log feedback for a run
Assuming you've already set up tracing in the LangSmith quick start, you can add user feedback to a run by returning the run ID and and including that in the create_feedback
call.
To use the run ID for a given call, use the "RunCollectorCallbackHandler" (or use the collect_runs()
. In python, you will have to specify include_run_info=True
when calling your chain or LLM. Below is an example:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your_api_key>
export LANGCHAIN_PROJECT=<your_project>
- Python
- TypeScript
import os
from langchain import chat_models, prompts, callbacks
from langchain.schema import output_parser
from langsmith import Client
chain = (
prompts.ChatPromptTemplate.from_messages(
[
("system", "You are a conversational bot."),
("user", "{input}"),
]
)
| chat_models.ChatOpenAI()
| output_parser.StrOutputParser()
)
client = Client()
def main():
with callbacks.collect_runs() as cb:
for tok in chain.stream({"input": "Hi, I'm Clara"}):
print(tok, end="", flush=True)
run_id = cb.traced_runs[0].ids
# ... User copies the generated response
client.create_feedback(run_id, "did_copy", score=True)
# ... User clicks a thumbs up button
client.create_feedback(run_id, "thumbs_up", score=True)
import { ChatOpenAI } from "langchain/chat_models/openai";
import { LLMChain } from "langchain/chains";
import { PromptTemplate } from "langchain/prompts";
// Supported in versions >= 0.0.139
import { RunCollectorCallbackHandler } from "langchain/callbacks";
import { Client } from "langsmith";
const client = new Client();
const prompt = PromptTemplate.fromTemplate("Say hi to {name}");
const llm = new ChatOpenAI();
const chain = prompt.pipe(llm);
const runCollector = new RunCollectorCallbackHandler();
async function main() {
const response = await chain.invoke({ name: "Clara" }, {callbacks: [runCollector]});
const runId = runCollector.tracedRuns[0].id;
// ... User copies the generated response
await client.createFeedback(runId, "did_copy", { score: true });
// ... User clicks a thumbs up button
await client.createFeedback(runId, "thumbs_up", { score: true });
}
main();
2. Filter runs based on feedback
Once you've collected runs, you can select some to analyze or export to a dataset. It's easiest to do this directly in the LangSmith web app directly, but you can also use the SDK. Below, select all runs that received a 'thumbs_up' from the user.
- Python
- TypeScript
runs = client.list_runs(
project_name="<your_project>",
filter='and(eq(feedback_key, "Correctness"), eq(feedback_score, 1.0))',
)
const runs: Run[] = []
for await (const run of client.listRuns({
project: "<your_project>",
filter: 'and(eq(feedback_key, "Correctness"), eq(feedback_score, 1.0))',
})) {
runs.push(run);
};
3. Export runs to dataset
After reviewing the runs or filtering for other metrics, you can export them to a dataset for further analysis, evaluation, or even for training new models. The easiest way to do so is via the web app directly by going to the project, selecting runs and clicking "Add to dataset".
- Python
- TypeScript
dataset = client.create_dataset(
dataset_name="Thumbs Up Runs",
description="Runs that received a thumbs up from the user",
)
for run in runs:
client.create_example_from_run(
run=run,
dataset_id=dataset.id,
)
const dataset = await client.createDataset({
name: "Thumbs Up Runs",
description: "Runs that received a thumbs up from the user",
});
for (const run of runs) {
await client.createExample(
inputs: run.inputs,
outputs: run.outputs,
{
datasetId: dataset.id,
}
)
}
Once you've created a dataset, you can use it to evaluate new prompts against, to train a new LLM, or for other purposes.