Eval Engineer
Role Overview
This mid-to-senior Eval Engineer role involves designing and implementing experiments to evaluate AI models, agents, and frameworks, focusing on creating datasets, scoring logic, and evaluation harnesses. The engineer will analyze results and publish findings to inform the developer ecosystem, working at the intersection of engineering and technical storytelling to establish better evaluation patterns. The impact includes improving AI system measurement and providing reproducible benchmarks for the industry.
Perks & Benefits
The role is fully remote with flexible time off, offering medical, dental, and vision insurance, plus daily lunch, snacks, and beverages. It includes a competitive salary and equity, an AI stipend, and a culture that values curiosity, creativity, and rapid iteration, with opportunities for career growth through publishing and open-source contributions. Time zone expectations are likely flexible, typical for remote tech roles.
Full Job Description
About the company
Braintrust is the AI observability platform. By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools to improve it.
Teams at Notion, Stripe, Zapier, Vercel, and Ramp use Braintrust to compare models, test prompts, and catch regressions — turning production data into better AI with every release.
About the role
We’re hiring an Eval Engineer to design and run creative evaluations of new AI capabilities. Your job is to turn emerging AI ideas into measurable experiments and publish the results for the developer ecosystem.
When new models, agents, or frameworks appear, everyone has opinions about what works but few people actually test them. This role exists to change that.
You’ll design experiments that compare models, prompts, and agent architectures against real tasks. You’ll build the datasets, scoring logic, and evaluation harnesses. Then you’ll publish the results so builders understand what actually works.
This role sits at the intersection of engineering, experimentation, and technical storytelling.
What you’ll own
Industry evals
Design and run evaluations of new AI capabilities
Compare frontier models, agent systems, and tool workflows
Turn emerging ideas into measurable benchmarks
Eval design
Define datasets, tasks, and scoring logic for experiments
Design realistic workloads that reflect production environments
Create tests that expose failure modes and edge cases
Experiment implementation
Build evaluation harnesses using Braintrust
Run comparisons across models, prompts, and agent approaches
Analyze traces, outputs, and failure patterns
Creative test construction
Invent novel ways to stress test AI systems
Design scenarios that break agents, prompts, and model reasoning
Build adversarial or complex datasets that reveal weaknesses
Technical content
Write technical posts explaining evaluation methodology and results
Share datasets and scoring logic so experiments are reproducible
Help establish better evaluation patterns for the industry via courses
Evaluation playbooks
Develop reusable eval patterns for agents, RAG systems, and LLM apps
Create open source reference implementations developers can adopt
Contribute examples and guides that help teams build better evals
What great looks like
You’re an engineer who likes testing systems more than building features
You enjoy breaking things and understanding why they fail
You can design experiments that isolate meaningful differences between approaches
You understand how LLMs, agents, and RAG systems actually work
You write clearly for technical audiences
You ship experiments quickly and iterate often
You care about methodology and reproducibility
You’re curious, creative, and opinionated about how AI should be evaluated
What you’ve done
Built or contributed to evaluation systems for LLM or agent applications
Designed experiments comparing models, prompts, or AI architectures
Written Python code to run tests across models or APIs
Built datasets or scoring logic for AI quality measurement
Investigated model failures or unexpected behaviors
Published technical blog posts, research notes, or engineering write-ups
Built prototypes quickly to test ideas
If you want to help the industry understand how to measure AI systems and design the evaluations everyone else learns from, this is the role.
Benefits include
Medical, dental, and vision insurance
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity
AI Stipend
Equal opportunity
Braintrust is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.
Similar jobs
Found 6 similar jobs