Evals
Remote / San Francisco, CA
Competitive + equity

Software Engineer, Evals

Apply now

There is no public benchmark for a customer's definition of quality. Evals turns accepted work, corrections, and failures into rubrics, golden sets, and routing decisions — the loop that makes every completed task improve the next one. You will build that loop.

What you will do

  • Build the rubric engine that lets domain experts express what good looks like in terms they recognize, then scores every run against it.

  • Turn accepted work into golden sets automatically, and corrections into labeled failure modes.

  • Drive routing with eval results: which model, which context, which level of human review each task deserves.

  • Design the dashboards and review surfaces where customer experts judge, correct, and approve agent work.

You will thrive in this role if you

  • Have worked on ML evaluation, data quality, or human-in-the-loop systems and know how easily metrics drift from meaning.

  • Are rigorous about statistics but pragmatic about products — an eval no expert will use measures nothing.

  • Want your work to be the company's answer to the question every enterprise buyer asks: how do you know it's right?

Apply now

Software Engineer, Evals
Email us your resume and a short note about the most impressive thing you have built or shipped. We read every application and reply to all of them.