Build datasets

Datasets are versioned collections of test cases that you use to run evaluations and track improvements over time. Build datasets from production logs, user feedback, manual curation, or generate them with Loop. Key advantages:

Versioned: Every change is tracked, so experiments can pin to specific versions
Integrated: Use directly in evaluations and populate from production
Scalable: Stored in a modern data warehouse without storage limits

Dataset structure

Each record has three top-level fields:

input: Data to recreate the example in your application (required).
expected: Ideal output or ground truth (optional but recommended for evaluation).
metadata: Key-value pairs for filtering and grouping (optional).

Create datasets

Upload CSV/JSON

The fastest way to create a dataset is uploading a CSV or JSON file:

Go to Datasets.
If there are existing datasets, click + Dataset. Otherwise, click Upload CSV/JSON.
Drag and drop your file in the Upload dataset dialog.
Columns automatically map to the input field. Drag and drop them into different categories as needed:
- Input: Fields used as inputs for your task.
- Expected: Ground truth or ideal outputs for scoring.
- Metadata: Additional context for filtering and grouping.
- Tags: Labels for organizing and filtering. When you categorize columns as tags, they’re automatically added to your project’s tag configuration.
- Do not import: Exclude columns from the dataset.
The preview table updates in real-time as you move columns between categories, showing exactly how your dataset will be structured.
Click Import.

If your data includes an id field, duplicate rows will be deduplicated, with only the last occurrence of each ID kept.

Create via SDK

Create datasets programmatically and populate them with records. The approach varies by language:

TypeScript/Python: Use the high-level initDataset() / init_dataset() method which automatically creates datasets and provides simple insert() operations.
Go/Ruby: Use lower-level API methods that require initializing an API client and explicitly managing dataset creation and record insertion.

import { initDataset } from "braintrust";

async function main() {
  // Initialize dataset (creates it if it doesn't exist)
  const dataset = initDataset("My App", { dataset: "Customer Support" });

  // Insert records with input, expected output, and metadata
  dataset.insert({
    input: { question: "How do I reset my password?" },
    expected: { answer: "Click 'Forgot Password' on the login page." },
    metadata: { category: "authentication", difficulty: "easy" },
  });

  dataset.insert({
    input: { question: "What's your refund policy?" },
    expected: { answer: "Full refunds within 30 days of purchase." },
    metadata: { category: "billing", difficulty: "easy" },
  });

  dataset.insert({
    input: { question: "How do I integrate your API with NextJS?" },
    expected: { answer: "Install the SDK and use our React hooks." },
    metadata: { category: "technical", difficulty: "medium" },
  });

  // Flush to ensure all records are saved
  await dataset.flush();
  console.log("Dataset created with 3 records");
}

main();

Generate with Loop

Ask Loop to create a dataset based on your logs or specific criteria:

Example queries:

“Generate a dataset from the highest-scoring examples in this experiment”
“Create a dataset with the most common inputs in the logs”

From user feedback

User feedback from production provides valuable test cases that reflect real user interactions. Use feedback to create datasets from highly-rated examples or problematic cases. See Capture user feedback for implementation details on logging feedback programmatically. To build datasets from feedback:

Filter logs by feedback scores using the Filter menu:
- scores.user_rating > 0.8 (SQL) or filter: scores.user_rating > 0.8 (BTQL) for highly-rated examples
- metadata.thumbs_up = false for negative feedback
- comment IS NOT NULL and scores.correctness < 0.5 for low-scoring feedback with comments
Select the traces you want to include.
Select Add to dataset.
Choose an existing dataset or create a new one.

You can also ask Loop to create datasets based on feedback patterns, such as “Create a dataset from logs with positive feedback” or “Build a dataset from cases where users clicked thumbs down.”

Log from production

Track user feedback from your application:

import { initDataset, Dataset } from "braintrust";

class MyApplication {
  private dataset: Dataset | undefined = undefined;

  async initApp() {
    this.dataset = await initDataset("My App", { dataset: "logs" });
  }

  async logUserExample(
    input: any,
    expected: any,
    userId: string,
    thumbsUp: boolean,
  ) {
    if (this.dataset) {
      this.dataset.insert({
        input,
        expected,
        metadata: { userId, thumbsUp },
      });
    }
  }
}

Manage datasets

From the dataset page, you can:

Filter and search records
Create custom columns to extract nested values
Edit records inline
Copy records between datasets
Delete individual records or entire datasets

Filter records

Read and filter datasets using _internal_btql to control which records are returned:

import { initDataset } from "braintrust";

// Read all records
const dataset = initDataset("My App", { dataset: "Customer Support" });

for await (const row of dataset) {
  console.log(row);
}

// Filter by metadata
const premiumDataset = initDataset("My App", {
  dataset: "Customer Support",
  _internal_btql: {
    filter: { btql: "metadata.category = 'premium'" },
    limit: 100,
  },
});

for await (const row of premiumDataset) {
  console.log(row);
}

// Sort by creation date
const sortedDataset = initDataset("My App", {
  dataset: "Customer Support",
  _internal_btql: {
    sort: [{ expr: { btql: "created" }, dir: "desc" }],
    limit: 50,
  },
});

// Combine filters and sorts
const recentSupport = initDataset("My App", {
  dataset: "Customer Support",
  _internal_btql: {
    filter: {
      btql: "metadata.category = 'support' and created > now() - interval 7 day",
    },
    sort: [{ expr: { btql: "created" }, dir: "desc" }],
    limit: 1000,
  },
});

For more information on SQL syntax and available operators, see the SQL reference documentation.

Update records

Update existing records by id:

import { initDataset } from "braintrust";

const dataset = initDataset("My App", { dataset: "Customer Support" });

// Insert a record
const id = dataset.insert({
  input: { question: "How do I reset my password?" },
  expected: { answer: "Click 'Forgot Password' on the login page." },
});

// Update the record
dataset.update({
  id,
  metadata: { reviewed: true, difficulty: "easy" },
});

await dataset.flush();

The update() method applies a merge strategy: only the fields you provide will be updated, and all other existing fields in the record will remain unchanged.

Delete records

Remove records programmatically by id:

import { initDataset } from "braintrust";

const dataset = initDataset("My App", { dataset: "Customer Support" });

// Insert a record
const id = dataset.insert({
  input: { question: "Test question" },
  expected: { answer: "Test answer" },
});

// Delete the record
await dataset.delete(id);
await dataset.flush();

To delete an entire dataset, use the UI or the API.

Flush records

The Braintrust SDK flushes records asynchronously and installs exit handlers, but these hooks are not always respected (e.g., by certain runtimes or when exiting a process abruptly). Call flush() to ensure records are written:

import { initDataset } from "braintrust";

const dataset = initDataset("My App", { dataset: "Customer Support" });

// Insert records
dataset.insert({
  input: { question: "How do I reset my password?" },
  expected: { answer: "Click 'Forgot Password' on the login page." },
});

// Flush to ensure all records are saved
await dataset.flush();

Create custom columns

Extract values from records using custom columns. Use SQL expressions to surface important fields directly in the table.

Use in evaluations

Use datasets as the data source for evaluations. You can pass datasets directly or convert experiment results into dataset format.

Pass datasets directly

Pass datasets directly to Eval():

import { initDataset, Eval } from "braintrust";
import { Levenshtein } from "autoevals";

Eval("Say Hi Bot", {
  data: initDataset("My App", { dataset: "My Dataset" }),
  task: async (input) => {
    return "Hi " + input;
  },
  scores: [Levenshtein],
});

Convert experiment results

Convert experiment results into dataset format using asDataset()/as_dataset(). This is useful for iterative improvement workflows where you want to use the results of one experiment as the baseline for future experiments:

import { init, Eval } from "braintrust";
import { Levenshtein } from "autoevals";

const experiment = init("My App", {
  experiment: "my-experiment",
  open: true,
});

Eval<string, string>("My App", {
  data: experiment.asDataset(),
  task: async (input) => {
    return `hello ${input}`;
  },
  scores: [Levenshtein],
});

Review datasets

You can configure human review workflows to label and evaluate dataset records with your team.

Configure review scores

Configure categorical scores to allow reviewers to rapidly label records. See Configure review scores for details.

Assign rows for review

Assign dataset rows to team members for review, analysis, or follow-up action. Assignments are particularly useful for distributing review work across multiple team members. See Assign rows for review for details.

Define schemas

If you want to ensure all records have the same structure or make editing easier, define JSON schemas for your dataset fields. Schemas are particularly useful when multiple team members are manually adding records or when you need strict data validation. Dataset schemas enable:

Validation: Catch structural errors when adding or editing records.
Form-based editing: Edit records with intuitive forms instead of raw JSON.
Documentation: Make field expectations explicit for your team.

To define a schema:

Go to your dataset.
Click Field schemas in the toolbar.
Select the field you want to define a schema for (input, expected, or metadata).
Click Infer schema to automatically generate a schema from the first 100 records, or manually define your schema structure.
Toggle Enforce to enable validation. When enabled:
- New records must conform or show validation errors.
- Existing non-conforming records display warnings.
- Form editing validates input as you type.

Enforcement is UI-only and doesn’t affect SDK inserts or updates.

Track performance

Monitor how dataset rows perform across experiments.

View experiment runs

See all experiments that used a dataset:

Go to your dataset page.
In the right panel, select Runs.
Review performance metrics across experiments.

Runs display as charts that show score trends over time. The time axis flows from oldest (left) to newest (right), making it easy to track performance evolution.

Filter experiment runs

To narrow down the list of experiment runs, you can filter by time range or use SQL. Filter by time range: Click and drag across any region of the chart to select a time range. The table below updates to show only experiments in that range. To clear the filter, click clear. This helps you focus on specific periods, like recent experiments or historical baselines. Filter with SQL: Select Filter and use the Basic tab for common filters, or switch to SQL to write more precise SQL queries based on criteria like score thresholds, time ranges, or experiment names. Common filtering examples:

-- Filter by time range
WHERE created > '2024-01-01'

-- Filter by score threshold
WHERE scores.Accuracy > 0.8

-- Filter by experiment name pattern
WHERE name LIKE '%baseline%'

-- Combine multiple conditions
WHERE created > now() - interval 7 day
  AND scores.Factuality > 0.7

Filter states are persisted in the URL, allowing you to bookmark or share specific filtered views of experiment runs.

Analyze per-row performance

See how individual rows perform:

Select a row in the dataset table.
In the right panel, select Runs.
Review the row’s metrics across experiments.

This view only shows experiments that set the origin field in eval traces.

Look for patterns:

Consistently low scores suggest ambiguous expectations.
Failures across experiments indicate edge cases.
High variance suggests instability.

Multimodal datasets

You can store and process images and other file types in your datasets. There are several ways to use files in Braintrust:

Image URLs (most performant) - Keep datasets lightweight with external image references.
Base64 (least performant) - Encode images directly in records.
Attachments (easiest to manage) - Store files directly in Braintrust.
External attachments - Reference files in your own object stores.

For large images, use image URLs to keep datasets lightweight. To keep all data within Braintrust, use attachments. Attachments support any file type including images, audio, and PDFs.

import { Attachment, initDataset } from "braintrust";
import path from "node:path";

async function createPdfDataset(): Promise<void> {
  const dataset = initDataset({
    project: "Project with PDFs",
    dataset: "My PDF Dataset",
  });
  for (const filename of ["example.pdf"]) {
    dataset.insert({
      input: {
        file: new Attachment({
          filename,
          contentType: "application/pdf",
          data: path.join("files", filename),
        }),
      },
    });
  }
  await dataset.flush();
}

createPdfDataset();

Next steps

Add human feedback to label datasets.
Run evaluations using your datasets.
Use the Loop to generate and optimize datasets.
Read the SQL reference for advanced filtering.

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Dataset structure

Create datasets

Upload CSV/JSON

Create via SDK

Generate with Loop

From user feedback

Log from production

Manage datasets

Filter records

Update records

Delete records

Flush records

Create custom columns

Use in evaluations

Pass datasets directly

Convert experiment results

Review datasets

Configure review scores

Assign rows for review

Define schemas

Track performance

View experiment runs

Filter experiment runs

Analyze per-row performance

Multimodal datasets

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Dataset structure

​Create datasets

​Upload CSV/JSON

​Create via SDK

​Generate with Loop

​From user feedback

​Log from production

​Manage datasets

​Filter records

​Update records

​Delete records

​Flush records

​Create custom columns

​Use in evaluations

​Pass datasets directly

​Convert experiment results

​Review datasets

​Configure review scores

​Assign rows for review

​Define schemas

​Track performance

​View experiment runs

​Filter experiment runs

​Analyze per-row performance

​Multimodal datasets

​Next steps

Dataset structure

Create datasets

Upload CSV/JSON

Create via SDK

Generate with Loop

From user feedback

Log from production

Manage datasets

Filter records

Update records

Delete records

Flush records

Create custom columns

Use in evaluations

Pass datasets directly

Convert experiment results

Review datasets

Configure review scores

Assign rows for review

Define schemas

Track performance

View experiment runs

Filter experiment runs

Analyze per-row performance

Multimodal datasets

Next steps