Isolating Tool Output in Agents
If indirect prompt injection is the disease, an autonomous agent is the patient with no immune system. Agents read web pages, query databases, call APIs, and feed the results straight back into their own context, then act on what they read. Every one of those results is attacker-influenceable text.
The unifying principle: the model is allowed to be wrong. Assume it will, at some point, be talked into doing the wrong thing. Build so that when it does, the blast radius is small. Security lives in the code and permissions around the model, not in the prompt.
1. Treat all tool and retrieval output as untrusted
Output from a tool is not "the answer." It is untrusted input that happens to be machine-generated. A web page, a database row, and a file's contents can all carry instructions. Label that content as untrusted the moment it enters your pipeline and carry that label everywhere it goes.
2. Least-privilege tools
An agent should hold the narrowest set of capabilities that lets it do its job. A research assistant that summarizes pages does not need a "send email" tool or shell access. If a capability is not present, no injection can invoke it.
3. Human review for high-impact actions
Reading is reversible; acting often is not. Gate irreversible or high-impact operations, such as sending messages, spending money, deleting data, or changing permissions, behind explicit human confirmation.
4. Output allow-lists, not deny-lists
When a tool's result feeds a decision, validate it against what you expect rather than trying to filter out what is bad. If a tool is supposed to return a country code, accept ^[A-Z]{2}$ and reject everything else. Deny-lists lose to creative encodings; allow-lists fail closed.
5. The dual / quarantined-LLM pattern
A powerful pattern is to split responsibilities between two models:
- A quarantined LLM is the only thing allowed to touch untrusted content. It extracts structured, schema-validated data. It has no tools and no authority.
- A privileged LLM orchestrates and calls tools, but only ever sees trusted, validated fields from the quarantine step, never the raw text.
The untrusted text never reaches the model that can act. Even if the quarantined model is hijacked, the worst it can do is return malformed structured data, which your schema rejects.
6. Provenance tagging
Track where every piece of context came from, and let trust level travel with the data. A defensive tool wrapper can attach provenance at the boundary so downstream code can make policy decisions:
from dataclasses import dataclass
from typing import Callable, Literal
Trust = Literal["trusted", "untrusted"]
@dataclass(frozen=True)
class Tagged:
"""Tool output carrying its provenance. Untrusted by default."""
value: str
source: str
trust: Trust = "untrusted"
def wrap_tool(name: str, fn: Callable[..., str], trust: Trust = "untrusted"):
def tool(*args, **kwargs) -> Tagged:
raw = fn(*args, **kwargs)
return Tagged(value=raw, source=name, trust=trust)
return tool
def require_trusted(t: Tagged) -> str:
if t.trust != "trusted":
raise PermissionError(
f"refusing to act on untrusted content from {t.source!r}"
)
return t.value
The exact code is not the point. Trust is a property you attach and check in your own code, not a vibe you hope the model maintains.
Putting it together
These patterns compose. Least privilege limits what injection can reach; provenance tagging tells you what is safe to act on; the quarantine pattern keeps hostile text away from authority; human review catches the high-stakes cases; allow-lists fail closed.
Notice what is not on this list: "write a better system prompt telling the model to ignore injections." Prompt wording is worth doing, but it is a speed bump, not a wall. Put your real boundaries in code.