The system prompt that tells your agent "never refund more than $500" is not a control. It is a suggestion the model will follow most of the time, ignore some of the time, and abandon entirely when an adversarial input finds the right shape. If the only place that rule exists is in prose, you have no enforcement. You have a hope.
That distinction — guidance versus enforcement — is where most agentic architectures go wrong before they ever reach production. Teams building their first agent tend to spend weeks tuning prompts to get reliable behavior, then ship with those prompts as the primary safety mechanism. The prompts work in demos. They work in most real conversations. They fail in the ones that matter, and when they fail, there is no second line of defense because the team confused "the model usually does X" with "the model is prevented from doing Y."
The practical fix is to move every actual control out of the model and into code. The model proposes; the runtime disposes. Here is how to do that in a way you can explain to your auditors and defend in a postmortem.
Start with what the model can see
The strongest control in an agentic system is the one that cannot fail, because the capability simply is not present. If a tool is not in the tool list passed to the model at session start, the model does not know it exists. It cannot call it. No prompt injection can summon it. No clever reasoning chain can stumble into it. This is why tool registration is the first and most important layer.
Design sessions to register the minimum tool set for the task and the user. A customer service agent handling a tier-one inquiry does not need the refund tool loaded at all if refunds require tier-two escalation. A read-only research agent does not need write tools registered. An agent running on behalf of a contractor does not see the tools available to an agent running on behalf of an employee. This is not a convenience decision — it is a security decision that happens at initialization, long before the model sees a single token of user input.
The practical move: build a session scoping layer that takes user identity, agent role, and task context, and returns a tool manifest. Make the default manifest empty. Tools get added explicitly, with justification, by the code that knows what this session is for. If your agent framework makes it easy to register all tools by default and filter later, fight that pattern. Registration should be the gate, not an afterthought.
Treat the orchestrator as your policy engine
Once a tool is registered, the model can emit a call to it. Before that call reaches the actual system, it passes through the orchestration layer — the agent framework code that sits between the model and your tools. This is where your real policy lives.
The orchestrator inspects every tool call. It sees the tool name, the arguments, the session context, and the conversation state. It decides whether to allow the call, modify it, reject it with a structured error, or pause for human approval. Every check that matters for compliance or risk belongs here, written as code, testable and auditable.
Concretely: the refund cap lives in the orchestrator, not the prompt. The rule that a specific action type requires a valid human-approval token lives in the orchestrator. The logic that says "this tool cannot be called more than five times per session" lives in the orchestrator. The check that confirms the arguments match the current user's authorized scope lives in the orchestrator. When the model emits execute_refund(amount=750) and your policy caps refunds at $500, the orchestrator rejects the call and returns a structured error to the model. The model then reasons about the error — often choosing to request human approval, which is exactly what you want — but the enforcement happened before the refund function was ever invoked.
The Monday-morning test: for every behavior you would describe to an auditor as a control, point at a specific piece of orchestrator code. If your answer is "it's in the system prompt," that control does not exist.
Design tools as if the agent is hostile
The tool's own implementation is the next layer of defense, and it should not trust the orchestrator. Write tools the way you would write any public API: validate every argument, verify every permission, log every invocation. Assume the caller is untrusted. Because even when the caller is your own agent framework, the model's output shaped those arguments, and you should treat model output as user-supplied data.
When execute_refund is invoked, it independently checks that its service account has refund permissions in the payment system, that the amount is within the service-level policy limit, that the customer exists, that any required approval tokens are valid and unexpired. It does not assume the orchestrator has already checked these things. If the orchestrator is misconfigured, bypassed, or has a bug, the tool still refuses unauthorized actions.
This has a second benefit: it lets you reuse tools across agents safely. A tool that enforces its own contract can be registered in multiple agent configurations without each configuration needing to re-implement the same checks. The tool becomes the unit of security reasoning, which is how you want it.
Let your IAM do its job
The backing system the tool calls into has its own identity and access management, and your agent is just another principal inside it. The tool executes with a specific service account, API key, or OAuth token — and that identity has been granted specific permissions in the target system through whatever IAM infrastructure you already run.
This matters because it means your existing security infrastructure does real work here. If the refund tool is somehow invoked inappropriately, but its service account lacks refund permissions in the payment system, the payment system refuses. You get defense-in-depth essentially for free, provided you do not paper over it by giving agent service accounts broad permissions out of convenience.
Scope service account permissions tightly. Use short-lived credentials where possible. Rotate regularly. Audit what each agent identity can do in each backing system, separately from what the agent framework thinks it can do. The two views should match — when they drift, the drift is almost always over-permission, and it is almost always where incidents come from.
Make the invisible behaviors controllable too
Tool access is the most visible enforcement surface, but several behaviors need controls that are not tied to specific tool calls.
Input filtering matters because prompt injection is real. Scan incoming prompts and retrieved context before it reaches the model. Strip or reject content containing instruction-override patterns, PII that should not be in context, or content from untrusted sources that should not carry authority. This is especially critical for agents that consume email, documents, or web content — every piece of text retrieved into context is potentially adversarial.
Output filtering matters because the model will occasionally produce content you cannot ship. Validate structured outputs against schemas before acting on them. Scan for PII leakage before returning to users. Enforce topic and tone constraints in regulated domains.
Context boundaries matter because retrieval is the loophole people forget. Your prompt cannot keep the agent from reasoning over HR records if your RAG system retrieves HR records into context. Implement authorization at the retrieval layer — row-level security on the vector store, user-scoped queries against the document index, filter rules that match the session identity. The model cannot leak what it never sees.
Budget and loop controls matter because agents can run away. Set maximum turns per session, maximum tool calls per turn, maximum tokens per interaction, and hard timeouts. A stuck reasoning loop or a model hallucinating work for itself becomes a cost incident without these limits, and potentially a safety incident if the loop is calling write tools. Circuit breakers on error rates are worth the small amount of code they take to implement.
Log everything, and make someone look at the logs
The last layer is observability. Every tool call, every orchestrator decision, every rejected action, every approval token, every model reasoning trace — all of it flows to the same logging and monitoring infrastructure you already use for the rest of production. Agent actions are system actions. They belong in your SIEM, your audit logs, your anomaly detection.
Make the full trace reconstructable: for any given action taken by an agent, you should be able to pull the session context, the user identity, the model's reasoning, the tool calls emitted, the orchestrator's decisions, and the final backing-system execution. If you cannot do this, your incident response will fail the first time you need it.
Set up alerts for the things that matter: unusual tool call volume, rejected-action spikes, approval token failures, tool calls outside business hours from agents that should not be running then. Rate-limit aggressively at the tool layer — an agent that suddenly calls execute_refund a hundred times in a minute is telling you something, and you want that signal to wake someone up.
If you have an agent in production or headed there soon, run this audit. For every behavior your team describes as a control, find the code that enforces it. If the only place it exists is a prompt, move it. Map your tool registration — is every agent getting the minimum tool set, or the default one? Check your retrieval layer for authorization — can an agent pull context its user is not cleared for? Confirm your service account permissions match what each agent actually needs, and nothing more.
Then pick one tool, ideally the highest-risk one, and trace a call through all five layers end-to-end. Registration, orchestration, tool implementation, backing system IAM, observability. If any layer is thin, that is the layer an incident will find. Thicken it before the incident does.
The model will remain probabilistic. That is not going to change, and it does not need to, because the job of the model was never to be the policy engine. The job of the architecture is to make the model's unreliability irrelevant to the system's guarantees. Build it that way, and the model being wrong is a logged event. Build it the other way, and the model being wrong is the incident.