Run evaluations

Run evaluations directly in your code using the Eval() function, use the braintrust eval CLI command to run multiple evaluations from files, or create experiments in the Braintrust UI for no-code workflows. Integrate with CI/CD to catch regressions automatically.

For iterative experimentation, use playgrounds to test prompts and models interactively, compare results side-by-side, and then save winning configurations as experiments.

Run with Eval()

The Eval() function runs an evaluation and creates an experiment:

import { Eval, initDataset } from "braintrust";
import { Factuality } from "autoevals";

Eval("My Project", {
  experimentName: "My experiment",
  data: initDataset("My Project", { dataset: "My dataset" }),
  task: async (input) => {
    // Your LLM call here
    return await callModel(input);
  },
  scores: [Factuality],
  metadata: {
    model: "gpt-5-mini",
  },
});

Running Eval() automatically:

Creates an experiment in Braintrust
Displays a summary in your terminal
Populates the UI with results
Returns summary metrics

You can pass a parameters option to Eval() to make configuration values (like model choice, temperature, or prompts) editable in the playground without changing code. Define parameters inline or use loadParameters() to reference saved configurations. See Write parameters and Remote evaluations for details.

Run with CLI

Use the braintrust eval command to run evaluations from files:

TypeScript
Python

npx braintrust eval basic.eval.ts

npx braintrust eval [file or directory] ...

The CLI loads environment variables from:

.env.development.local
.env.local
.env.development
.env

braintrust eval eval_basic.py

braintrust eval [file or directory] ...

The CLI loads environment variables from the same files as the TypeScript version.

Use --watch to re-run evaluations automatically when files change:

npx braintrust eval --watch basic.eval.ts

Run in UI

Create and run experiments directly in the Braintrust UI without writing code:

Navigate to Evaluations > Experiments.
Click + Experiment or use the empty state form.
Select one or more prompts, workflows, or scorers to evaluate.
Choose or create a dataset:
- Select existing dataset: Pick from datasets in your organization
- Upload CSV/JSON: Import test cases from a file
- Empty dataset: Create a blank dataset to populate manually later
Add scorers to measure output quality.
Click Create to execute the experiment.

This workflow is ideal when you have prompts ready and want to quickly run experiments against datasets.

For iterative experimentation, use playgrounds to test prompts and models interactively, compare results side-by-side, and save winning configurations as experiments.

UI experiments timeout after 15 minutes. For longer-running evaluations, use the SDK or CLI approach.

Run in CI/CD

Integrate evaluations into your CI/CD pipeline to catch regressions automatically.

GitHub Actions

Use the braintrustdata/eval-action to run evaluations on every pull request:

- name: Run Evals
  uses: braintrustdata/eval-action@v1
  with:
    api_key: ${{ secrets.BRAINTRUST_API_KEY }}
    runtime: node

The action automatically posts a comment with results:

Full example workflow:

name: Run evaluations

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm install

      - name: Run Evals
        uses: braintrustdata/eval-action@v1
        with:
          api_key: ${{ secrets.BRAINTRUST_API_KEY }}
          runtime: node

Other CI systems

For other CI systems, run evaluations as a standard command:

# Install dependencies
npm install

# Run evaluations
npx braintrust eval evals/

Ensure your CI environment has the BRAINTRUST_API_KEY environment variable set.

Run remotely

Expose evaluations running on remote servers or local machines using dev mode:

npx braintrust eval --dev basic.eval.ts

This allows you to trigger evaluations from playgrounds and experiments. See Run remote evaluations for details.

Run locally

Run evaluations without sending logs to Braintrust for quick iteration.

npx braintrust eval --no-send-logs basic.eval.ts

braintrust eval --no-send-logs eval_basic.py

Eval(
  "Say Hi Bot",
  {
    data: () => [{ input: "David", expected: "Hi David" }],
    task: (input) => "Hi " + input,
    scores: [Factuality],
  },
  {
    noSendLogs: true, // Run locally without creating experiment
  },
);

Configure experiments

Customize experiment behavior with options:

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],

  // Experiment name
  experiment: "gpt-4o-experiment",

  // Metadata for filtering/analysis
  metadata: {
    model: "gpt-4o",
    prompt_version: "v2",
  },

  // Maximum concurrency
  maxConcurrency: 10,

  // Trial count for averaging
  trialCount: 3,
});

Run trials

Run each input multiple times to measure variance and get more robust scores. Braintrust intelligently aggregates results by bucketing test cases with the same input value:

Eval("My Project", {
  data: myDataset,
  task: myTask,
  scores: [Factuality],
  trialCount: 10, // Run each input 10 times
});

Use hill climbing

Sometimes you don’t have expected outputs and want to use a previous experiment as a baseline instead. Hill climbing enables iterative improvement by comparing new experiments to previous ones, which is especially useful when you lack a pre-existing benchmark. Braintrust supports hill climbing as a first-class concept, allowing you to use a previous experiment’s output field as the expected field for the current experiment. Autoevals includes scorers like Battle and Summary designed specifically for hill climbing. To enable hill climbing, use BaseExperiment() in the data field:

import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";

Eval<string, string, string>(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment(),
    task: (input) => {
      return "Hi " + input; // Replace with your task function
    },
    scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })],
  },
);

Braintrust automatically picks the best base experiment using git metadata if available or timestamps otherwise, then populates the expected field by merging the expected and output fields from the base experiment. If you set expected through the UI while reviewing results, it will be used as the expected field for the next experiment.

Use a specific experiment

To use a specific experiment as the base, pass the name field to BaseExperiment():

import { Battle } from "autoevals";
import { Eval, BaseExperiment } from "braintrust";

Eval<string, string, string>(
  "Say Hi Bot", // Replace with your project name
  {
    data: BaseExperiment({ name: "main-123" }),
    task: (input) => {
      return "Hi " + input; // Replace with your task function
    },
    scores: [Battle.partial({ instructions: "Which response said 'Hi'?" })],
  },
);

Scoring considerations

When hill climbing, use two types of scoring functions:

Non-comparative methods like ClosedQA that judge output quality based purely on input and output without requiring an expected value. Track these across experiments to compare any two experiments, even if they aren’t sequentially related.
Comparative methods like Battle or Summary that accept an expected output but don’t treat it as ground truth. If you score > 50% on a comparative method, you’re doing better than the base on average. Learn more about how Battle and Summary work.

Create custom reporters

When you run an experiment, Braintrust logs results to your terminal, and braintrust eval returns a non-zero exit code if any eval throws an exception. Customize this behavior for CI/CD pipelines to precisely define what constitutes a failure or to report results to different systems. Define custom reporters using Reporter(). A reporter has two functions:

import { Reporter } from "braintrust";

Reporter(
  "My reporter", // Replace with your reporter name
  {
    reportEval(evaluator, result, opts) {
      // Summarizes the results of a single reporter, and return whatever you
      // want (the full results, a piece of text, or both!)
    },

    reportRun(results) {
      // Takes all the results and summarizes them. Return a true or false
      // which tells the process to exit.
      return true;
    },
  },
);

Any Reporter included among your evaluated files will be automatically picked up by the braintrust eval command.

If no reporters are defined, the default reporter logs results to the console.
If you define one reporter, it’s used for all Eval blocks.
If you define multiple Reporters, specify the reporter name as an optional third argument to Eval().

Include attachments

Braintrust allows you to log binary data like images, audio, and PDFs as attachments. Use attachments in evaluations by initializing an Attachment object in your data:

import { Eval, Attachment } from "braintrust";
import { NumericDiff } from "autoevals";
import path from "path";

function loadPdfs() {
  return ["example.pdf"].map((pdf) => ({
    input: {
      file: new Attachment({
        filename: pdf,
        contentType: "application/pdf",
        data: path.join("files", pdf),
      }),
    },
    // This is a toy example where we check that the file size is what we expect.
    expected: 469513,
  }));
}

async function getFileSize(input: { file: Attachment }) {
  return (await input.file.data()).size;
}

Eval("Project with PDFs", {
  data: loadPdfs,
  task: getFileSize,
  scores: [NumericDiff],
});

You can also store attachments in a dataset for reuse across multiple experiments. After creating the dataset, reference it by name in an eval. The attachment data is automatically downloaded from Braintrust when accessed:

import { NumericDiff } from "autoevals";
import { initDataset, Eval, ReadonlyAttachment } from "braintrust";

async function getFileSize(input: {
  file: ReadonlyAttachment;
}): Promise<number> {
  return (await input.file.data()).size;
}

Eval("Project with PDFs", {
  data: initDataset({
    project: "Project with PDFs",
    dataset: "My PDF Dataset",
  }),
  task: getFileSize,
  scores: [NumericDiff],
});

Use attachment URLs

Obtain a signed URL for the attachment to forward to other services like OpenAI:

import { initDataset, wrapOpenAI, ReadonlyAttachment } from "braintrust";
import { OpenAI } from "openai";

const client = wrapOpenAI(
  new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
  }),
);

async function main() {
  const dataset = initDataset({
    project: "Project with images",
    dataset: "My Image Dataset",
  });
  for await (const row of dataset) {
    const attachment: ReadonlyAttachment = row.input.file;
    const attachmentUrl = (await attachment.metadata()).downloadUrl;
    const response = await client.chat.completions.create({
      model: "gpt-4o",
      messages: [
        {
          role: "system",
          content: "You are a helpful assistant",
        },
        {
          role: "user",
          content: [
            { type: "text", text: "Please summarize the attached image" },
            { type: "image_url", image_url: { url: attachmentUrl } },
          ],
        },
      ],
    });
    const summary = response.choices[0].message.content || "Unknown";
    console.log(
      `Summary for file ${attachment.reference.filename}: ${summary}`,
    );
  }
}

main();

Trace your evals

Add detailed tracing to your evaluation task functions to measure performance and debug issues. Each span in the trace represents an operation like an LLM call, database lookup, or API request.

Use wrapOpenAI/wrap_openai to automatically trace OpenAI API calls. See Add custom tracing for details.

Each call to experiment.log() creates its own trace. Do not mix experiment.log() with tracing functions like traced() - this creates incorrectly parented traces.

Wrap task code with traced() to log incrementally to spans. This example progressively logs input, output, and metrics:

import { Eval, traced } from "braintrust";

async function callModel(input: string) {
  return traced(
    async (span) => {
      const messages = { messages: [{ role: "system", text: input }] };
      span.log({ input: messages });

      // Replace this with a model call
      const result = {
        content: "China",
        latency: 1,
        prompt_tokens: 10,
        completion_tokens: 2,
      };

      span.log({
        output: result.content,
        metrics: {
          latency: result.latency,
          prompt_tokens: result.prompt_tokens,
          completion_tokens: result.completion_tokens,
        },
      });
      return result.content;
    },
    {
      name: "My AI model",
    },
  );
}

const exactMatch = (args: {
  input: string;
  output: string;
  expected?: string;
}) => {
  return {
    name: "Exact match",
    score: args.output === args.expected ? 1 : 0,
  };
};

Eval("My Evaluation", {
  data: () => [
    { input: "Which country has the highest population?", expected: "China" },
  ],
  task: async (input, { span }) => {
    return await callModel(input);
  },
  scores: [exactMatch],
});

This creates a span tree you can visualize in the UI by clicking on each test case in the experiment.

Next steps

Interpret results from your experiments
Compare experiments to measure improvements
Write scorers to measure quality
Use playgrounds for no-code experimentation

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Run with Eval()

Run with CLI

Run in UI

Run in CI/CD

GitHub Actions

Other CI systems

Run remotely

Run locally

Configure experiments

Run trials

Use hill climbing

Use a specific experiment

Scoring considerations

Create custom reporters

Include attachments

Use attachment URLs

Trace your evals

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Run with Eval()

​Run with CLI

​Run in UI

​Run in CI/CD

​GitHub Actions

​Other CI systems

​Run remotely

​Run locally

​Configure experiments

​Run trials

​Use hill climbing

​Use a specific experiment

​Scoring considerations

​Create custom reporters

​Include attachments

​Use attachment URLs

​Trace your evals

​Next steps

Run with Eval()

Run with CLI

Run in UI

Run in CI/CD

GitHub Actions

Other CI systems

Run remotely

Run locally

Configure experiments

Run trials

Use hill climbing

Use a specific experiment

Scoring considerations

Create custom reporters

Include attachments

Use attachment URLs

Trace your evals

Next steps