There is no public benchmark for a customer's definition of quality. Evals turns accepted work, corrections, and failures into rubrics, golden sets, and routing decisions — the loop that makes every completed task improve the next one. You will build that loop.
What you will do
Build the rubric engine that lets domain experts express what good looks like in terms they recognize, then scores every run against it.
Turn accepted work into golden sets automatically, and corrections into labeled failure modes.
Drive routing with eval results: which model, which context, which level of human review each task deserves.
Design the dashboards and review surfaces where customer experts judge, correct, and approve agent work.
You will thrive in this role if you
Have worked on ML evaluation, data quality, or human-in-the-loop systems and know how easily metrics drift from meaning.
Are rigorous about statistics but pragmatic about products — an eval no expert will use measures nothing.
Want your work to be the company's answer to the question every enterprise buyer asks: how do you know it's right?