When crafting AI features, the right model and prompt choice can make all the difference. In this post, I'll walk you through a workflow that lets you confidently test combinations at scale—and why Braintrust is the most developer‑friendly way to do it.
Every AI developer I've worked with eventually asks the same thing: "Which model should I use—and with what prompt?"
When you're dealing with real user queries or production traffic, your instincts only take you so far. You need evidence:
Braintrust is built for that. It turns model and prompt testing from guesswork into rigorous, measurable experiments.
You don't have to choose accuracy, cost, or speed blindly—Braintrust lets you compare all three across model/prompt combinations.
The developer-friendly way I approach this is to treat the problem as a matrix of tests:
| Model | Prompt A | Prompt B | Prompt C |
|---|---|---|---|
| GPT‑4o | Score | Score | Score |
| GPT‑3.5‑turbo | Score | Score | Score |
| Claude 3 Haiku | Score | Score | Score |
Each cell represents running a dataset through a model with a specific prompt, scoring the output, and measuring cost/latency. Braintrust tooling covers all these parts in one platform.
Grab a representative set of inputs—like actual customer queries, support tickets, or document summaries. In Braintrust, you store these as datasets, keeping them versioned and shareable across projects.
Braintrust treats prompts as versioned, first-class objects. You can author, update, and track them alongside code.
"You are a support agent. Respond clearly and accurately using product documentation.
Question: {{input}}"
You can manage changes, pin by version ID, and understand how revisions affect results.
Bridge between providers—OpenAI, Claude, Mistral, AWS Bedrock, Vertex—and your code using Braintrust's unified LLM proxy. Swap models without rewrite.
import braintrust as bt
result = bt.llm.complete(model="gpt‑4o", prompt="Summarize: {{input}}", variables={"input": "Text here…"})
Use autoevals or custom scoring functions to measure performance:
def contains_keywords(output, expected_keywords):
return all(k in output for k in expected_keywords)
bt.register_scorer("contains_keywords", contains_keywords)
This gives you pass/fail metrics instead of eyeballing outputs.
Braintrust's experiments UI allows you to:
You can drill into individual examples, understand regressions, and fine‑tune your configurations.
Findings:
The outcome: use Claude 3 Haiku in prod, with GPT‑4o as fallback—all orchestrated and benchmarked in Braintrust.
Model and prompt testing doesn't have to be fragmented. Braintrust brings together versioning, evaluation, observability, and provider flexibility in a developer-first platform.
If you're ready to test with confidence—and choose better model/prompts based on data—start experimenting with Braintrust today. Your AI workflows will become clearer, faster, and more resilient.