6 Jun 2026

The reward desert

The Context team

There is a standard recipe now for making a model better at a task: collect labeled examples, build an environment, and fine-tune until a reward goes up. It is the recipe behind a lot of the recent progress, and it works beautifully when the task has a clean, checkable answer: a game you win or lose, a math problem with a solution, code that passes or fails its tests.

Most enterprise knowledge work is not like that. It lives in what we call a reward desert: a place where the signal of what good was is messy, shifting, and almost impossible to recover after the fact.

Where reinforcement learning works, and where it stops

Reinforcement learning needs a reward, and a reward needs to be verifiable. In a game the verifier is the score. In code it is the test suite. In math it is the proof checker. Give a learning loop a crisp, cheap, automatic check of success and it will climb toward it. That is the terrain RL was built for, and on that terrain it is extraordinary.

Enterprise knowledge work rarely offers such a check. Was this diligence memo good? There is no test suite for that, no proof checker, and the honest answer depends on context that was true at the moment of the work and is gone by the time anyone scores it. The desert is not that the work is unmeasurable. It is that the measurement is not available where the learning loop wants it: cheap, automatic, and after the fact.

Four ways to manufacture a signal, and what each misses

So people try to manufacture the reward the desert does not provide. There are four common attempts, and each misses something specific.

The first is post-hoc expert grading: have experts score outputs after the fact. The trouble is that an expert grading a finished memo cannot reconstruct decision-time state, what was visible, what constraints applied, which precedent mattered. They grade the artifact, not the decision, and those are different things. The grade is not isomorphic to the work.

The second is domain RL environments: build a training environment for a category of work. This can teach general task competence, how due diligence works in the abstract, but not institutional task competence, how due diligence works at your company, which is precisely the part that fails in deployment.

The third is synthetic simulation: use capture tools to fabricate executions to train on. Simulation covers the common path well and misses exactly the long-tail, exception-heavy complexity that dominates real production work. You learn the easy eighty percent and stall on the twenty that was the whole point of automating the work.

The fourth is process and log mining: mine the records of what happened. Logs capture what was done, not what should have been done. They miss the why, the corrections, the reasoning, and the path not taken, which together are most of the signal worth having.

The signal exists, but only at decision time

Notice what all four are working around. A real reward signal does exist, but only at one place and time: the moment the work is actually done, in the head of the person doing it. What good looks like, why this exception applies, which constraint bound the decision, the correction a reviewer makes, all of it is present then and decays almost immediately after.

Make it concrete. A partner approved an exception in a hallway conversation. An analyst weighed a risk the model never saw because it lived in a side thread. The comparable set was chosen because of something a client said last week. A grader handed the finished memo a month later sees none of that. They can tell you whether the memo reads well. They cannot tell you whether it was the right memo, because the facts that made it right are no longer in the room.

Every one of the four approaches is an attempt to recover that signal later, from outside, and each loses the part that made it valuable. The only way to get a clean reward signal for this kind of work is to be present at decision time and capture it as a byproduct of the work itself. You cannot mine the desert for water that evaporated before you arrived.

Capture beats reconstruction

This is the whole difference. A separate labeling pipeline, however well run, is reconstruction: it asks an expert to re-derive, from a finished output, a judgment that depended on context the output no longer carries. Capturing the signal where the work happens is the opposite. When a person accepts an output, edits it, or rejects it during the actual task, the judgment arrives with its full context attached, because the context is still there.

Same expert, same judgment, but one is made in the desert and one is made at the well. The data is not better because the people are better. It is better because of where and when it was collected. Decision-time capture is not a nicer way to label; it is the only way to get the label that actually describes the decision.

Why this is bigger than data quality

It is tempting to read this as a data-quality argument, and it is one, but it is also a structural one. The companies whose business is the reward signal, the standalone labeling platforms, are on the wrong side of it by construction: they collect judgments outside the work, because they are not the surface where the work happens.

You cannot fix decontextualized annotation with a better annotation tool, because the missing thing is the context, and the context is not in the tool. The reward signal for long-horizon enterprise work belongs to whoever is present when the work is done. That is a different position than selling labels, and it is not one a labeling vendor can reach by improving the labels.

It is why we capture the reward signal inside the work surface itself, as people accept, edit, and reject real outputs, rather than in a separate tool downstream. Not because it is more convenient, though it is, but because it is the only place the signal is still whole.

Be there when the decision is made

The label-and-fine-tune recipe is not wrong. It is out of its range. It needs a verifiable reward, and most enterprise work does not hand you one. The four ways to manufacture one all reach back into the desert and come out with less than they went looking for. The signal is real, but it lives at decision time, and the only way to keep it is to be there when the decision is made. In a reward desert, capture beats reconstruction, and that is the entire game.