How to manage datasets programmatically
You can use the Python and TypeScript SDK to manage datasets programmatically. This includes creating, updating, and deleting datasets, as well as adding examples to them.
Create a dataset
Create a dataset from list of values
The most flexible way to make a dataset using the client is by creating examples from a list of inputs and optional outputs. Below is an example.
Note that you can add arbitrary metadata to each example, such as a note or a source. The metadata is stored as a dictionary.
If you have many examples to create, consider using the create_examples
/createExamples
method to create multiple examples in a single request.
If creating a single example, you can use the create_example
/createExample
method.
- Python
- TypeScript
from langsmith import Client
example_inputs = [
("What is the largest mammal?", "The blue whale"),
("What do mammals and birds have in common?", "They are both warm-blooded"),
("What are reptiles known for?", "Having scales"),
("What's the main characteristic of amphibians?", "They live both in water and on land"),
]
client = Client()
dataset_name = "Elementary Animal Questions"
# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
dataset_name=dataset_name, description="Questions and answers about animal phylogenetics.",
)
# Prepare inputs, outputs, and metadata for bulk creation
inputs = [{"question": input_prompt} for input_prompt, _ in example_inputs]
outputs = [{"answer": output_answer} for _, output_answer in example_inputs]
metadata = [{"source": "Wikipedia"} for _ in example_inputs]
client.create_examples(
inputs=inputs,
outputs=outputs,
metadata=metadata,
dataset_id=dataset.id,
)
import { Client } from "langsmith";
const client = new Client();
const exampleInputs: [string, string][] = [
["What is the largest mammal?", "The blue whale"],
["What do mammals and birds have in common?", "They are both warm-blooded"],
["What are reptiles known for?", "Having scales"],
[
"What's the main characteristic of amphibians?",
"They live both in water and on land",
],
];
const datasetName = "Elementary Animal Questions";
// Storing inputs in a dataset lets us
// run chains and LLMs over a shared set of examples.
const dataset = await client.createDataset(datasetName, {
description: "Questions and answers about animal phylogenetics",
});
// Prepare inputs, outputs, and metadata for bulk creation
const inputs = exampleInputs.map(([inputPrompt]) => ({ question: inputPrompt }));
const outputs = exampleInputs.map(([, outputAnswer]) => ({ answer: outputAnswer }));
const metadata = exampleInputs.map(() => ({ source: "Wikipedia" }));
// Use the bulk createExamples method
await client.createExamples({
inputs,
outputs,
metadata,
datasetId: dataset.id,
});
Create a dataset from traces
To create datasets from the runs (spans) of your traces, you can use the same approach. For many more examples of how to fetch and filter runs, see the export traces guide. Below is an example:
- Python
- TypeScript
from langsmith import Client
client = Client()
dataset_name = "Example Dataset"
# Filter runs to add to the dataset
runs = client.list_runs(
project_name="my_project",
is_root=True,
error=False,
)
dataset = client.create_dataset(dataset_name, description="An example dataset")
# Prepare inputs and outputs for bulk creation
inputs = [run.inputs for run in runs]
outputs = [run.outputs for run in runs]
# Use the bulk create_examples method
client.create_examples(
inputs=inputs,
outputs=outputs,
dataset_id=dataset.id,
)
import { Client, Run } from "langsmith";
const client = new Client();
const datasetName = "Example Dataset";
// Filter runs to add to the dataset
const runs: Run[] = [];
for await (const run of client.listRuns({
projectName: "my_project",
isRoot: 1,
error: false,
})) {
runs.push(run);
}
const dataset = await client.createDataset(datasetName, {
description: "An example dataset",
dataType: "kv",
});
// Prepare inputs and outputs for bulk creation
const inputs = runs.map(run => run.inputs);
const outputs = runs.map(run => run.outputs ?? {});
// Use the bulk createExamples method
await client.createExamples({
inputs,
outputs,
datasetId: dataset.id,
});
Create a dataset from a CSV file
In this section, we will demonstrate how you can create a dataset by uploading a CSV file.
First, ensure your CSV file is properly formatted with columns that represent your input and output keys. These keys will be utilized to map your data properly during the upload. You can specify an optional name and description for your dataset. Otherwise, the file name will be used as the dataset name and no description will be provided.
- Python
- TypeScript
from langsmith import Client
import os
client = Client()
csv_file = 'path/to/your/csvfile.csv'
input_keys = ['column1', 'column2'] # replace with your input column names
output_keys = ['output1', 'output2'] # replace with your output column names
dataset = client.upload_csv(
csv_file=csv_file,
input_keys=input_keys,
output_keys=output_keys,
name="My CSV Dataset",
description="Dataset created from a CSV file"
data_type="kv"
)
import { Client } from "langsmith";
const client = new Client();
const csvFile = 'path/to/your/csvfile.csv';
const inputKeys = ['column1', 'column2']; // replace with your input column names
const outputKeys = ['output1', 'output2']; // replace with your output column names
const dataset = await client.uploadCsv({
csvFile: csvFile,
fileName: "My CSV Dataset",
inputKeys: inputKeys,
outputKeys: outputKeys,
description: "Dataset created from a CSV file",
dataType: "kv"
});