~/promptexploit.com/posts/isolating-tool-output-in-agents

promptexploit

i'm feeling ★ adversarial ★

Isolating Tool Output in Agents

If indirect prompt injection is the disease, an autonomous agent is the patient with no immune system. Agents read web pages, query databases, call APIs, and feed the results straight back into their own context, then act on what they read. Every one of those results is attacker-influenceable text.

The unifying principle: the model is allowed to be wrong. Assume it will, at some point, be talked into doing the wrong thing. Build so that when it does, the blast radius is small. Security lives in the code and permissions around the model, not in the prompt.

1. Treat all tool and retrieval output as untrusted

Output from a tool is not "the answer." It is untrusted input that happens to be machine-generated. A web page, a database row, and a file's contents can all carry instructions. Label that content as untrusted the moment it enters your pipeline and carry that label everywhere it goes.

2. Least-privilege tools

An agent should hold the narrowest set of capabilities that lets it do its job. A research assistant that summarizes pages does not need a "send email" tool or shell access. If a capability is not present, no injection can invoke it.

3. Human review for high-impact actions

Reading is reversible; acting often is not. Gate irreversible or high-impact operations, such as sending messages, spending money, deleting data, or changing permissions, behind explicit human confirmation.

4. Output allow-lists, not deny-lists

When a tool's result feeds a decision, validate it against what you expect rather than trying to filter out what is bad. If a tool is supposed to return a country code, accept ^[A-Z]{2}$ and reject everything else. Deny-lists lose to creative encodings; allow-lists fail closed.

5. The dual / quarantined-LLM pattern

A powerful pattern is to split responsibilities between two models:

The untrusted text never reaches the model that can act. Even if the quarantined model is hijacked, the worst it can do is return malformed structured data, which your schema rejects.

6. Provenance tagging

Track where every piece of context came from, and let trust level travel with the data. A defensive tool wrapper can attach provenance at the boundary so downstream code can make policy decisions:

from dataclasses import dataclass
from typing import Callable, Literal

Trust = Literal["trusted", "untrusted"]

@dataclass(frozen=True)
class Tagged:
    """Tool output carrying its provenance. Untrusted by default."""
    value: str
    source: str
    trust: Trust = "untrusted"

def wrap_tool(name: str, fn: Callable[..., str], trust: Trust = "untrusted"):
    def tool(*args, **kwargs) -> Tagged:
        raw = fn(*args, **kwargs)
        return Tagged(value=raw, source=name, trust=trust)
    return tool

def require_trusted(t: Tagged) -> str:
    if t.trust != "trusted":
        raise PermissionError(
            f"refusing to act on untrusted content from {t.source!r}"
        )
    return t.value

The exact code is not the point. Trust is a property you attach and check in your own code, not a vibe you hope the model maintains.

Putting it together

These patterns compose. Least privilege limits what injection can reach; provenance tagging tells you what is safe to act on; the quarantine pattern keeps hostile text away from authority; human review catches the high-stakes cases; allow-lists fail closed.

Notice what is not on this list: "write a better system prompt telling the model to ignore injections." Prompt wording is worth doing, but it is a speed bump, not a wall. Put your real boundaries in code.