跳到主要内容

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

· 阅读需 6 分钟
Mike Thrift
Mike Thrift
Marketing Manager

The central safety problem for any write-back agent is: how do you stop it from taking an action it was never supposed to take? GuardAgent (Xiang et al., ICML 2025) proposes a dedicated guardrail agent — a separate LLM agent that checks every action of the target agent against a set of safety policies before it executes. For Bean Labs, where the question of "can the agent write to the ledger without violating accounting rules?" is non-negotiable, this paper sits squarely at the core of our research agenda.

The paper

2026-05-25-guardagent-safeguard-llm-agents-guard-agent-knowledge-enabled-reasoning

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning, by Zhen Xiang and colleagues from UIUC, Emory, and UC Berkeley, introduces what the authors claim is the first general-purpose guardrail agent for LLM systems. The core idea is straightforward: rather than embedding safety constraints directly in the target agent's prompt (where they can be ignored or forgotten), GuardAgent runs as an independent agent that intercepts target agent actions and verifies them against stated policies.

The mechanism has two stages. First, the LLM reads the safety policy and the target agent's proposed input or output, then writes a step-by-step action plan describing what to check. Second, it converts that plan into executable Python code using a predefined toolbox of callable functions. An external Python interpreter runs the code, producing a deterministic yes/no verdict. A memory module stores past task tuples (policy, plan, code) and retrieves similar prior examples using Levenshtein distance, enabling few-shot in-context learning without any additional fine-tuning. The paper evaluates the approach on two new benchmarks introduced by the authors.

Key ideas

  • Code execution beats natural-language verdicts. When safety decisions are rendered as code rather than LLM text, the output is deterministic: the Python engine either allows or blocks, with no hallucination risk in the final decision step. The LLM only has to generate correct code, not a correct final answer.
  • Two custom benchmarks. EICU-AC (316 examples) tests access control for EHRAgent over a healthcare database with three user roles and ten databases. Mind2Web-SC (200 examples) tests six safety policies for the SeeAct web agent. Both were created by the authors specifically for this paper.
  • 98.7% and 90.0% label accuracy (GPT-4) on EICU-AC and Mind2Web-SC respectively. Critically, final response accuracy — whether the target agent still completes its task when it should be allowed — stays at 100% in both settings.
  • Hardcoded safety rules fail badly. Embedding policies directly in the target agent's system prompt achieves only 81.0% / 77.5% accuracy on the two benchmarks, and causes the target agent to fail 29.2% / 71.0% of its legitimate tasks — an unacceptable utility-safety tradeoff.
  • LlamaGuard is the wrong tool here. The content moderation baseline achieves roughly 50% accuracy — near chance — because it was designed for harmful text, not for reasoning about whether an agent action violates a specific access-control rule.
  • Removing the toolbox costs about 8 points. Without the callable function library, GuardAgent falls from 98.7% to 90.8% on EICU-AC, though it partly compensates by autonomously defining needed functions (68% of outputs generate code even without explicit instruction, which the authors treat as supporting evidence for the code-generation design).

What holds up — and what doesn't

The central architectural insight — that code execution gives you a deterministic enforcement point — is genuinely useful, and the ablations are honest. The comparison with hardcoded safety rules is particularly convincing: it shows that naive "just add rules to the prompt" designs degrade target utility while still failing to enforce safety reliably.

But the evaluation has real limits. The two benchmarks are small (316 and 200 examples), and the authors constructed both, which creates an obvious overfitting risk. EICU-AC is essentially an access-control matrix (role × database), which is a structured, enumerable problem — the kind of thing code is naturally good at. Mind2Web-SC is messier, and the 90.0% figure there is considerably less impressive than it appears at first: the authors acknowledge that rule 5 (covering "movies, music, and videos") causes the most failures because it requires broad open-world reasoning. That's the kind of rule a real finance agent would face constantly.

The memory module retrieves demonstrations by string similarity, which works fine for repeated policy types but will degrade on genuinely novel policies. And the entire framework assumes a "trusted context" — the safety policies themselves must be provided by a trusted administrator. If an attacker can modify the policies, or if the toolbox contains unsafe functions, GuardAgent offers no protection. The paper does not model adversarial policy manipulation. Follow-on work (ShieldAgent, arXiv:2503.22738; AGrail, arXiv:2502.11448) has already highlighted these gaps, with ShieldAgent reporting a 11.3% average improvement over GuardAgent across broader benchmarks.

Why this matters for finance AI

The Beancount write-back agent needs more than a safety prompt — it needs a mechanism to enforce accounting rules that is structurally separate from the agent doing the work. GuardAgent's architecture maps directly to this: a guard agent that checks every proposed journal entry against accounting rules (debit == credit, no posting to locked periods, no modification of reconciled transactions) before the write executes. The code-execution enforcement layer is especially attractive here because double-entry arithmetic is exactly the kind of structured, enumerable check that code handles reliably and LLM text does not.

The honest limitation is that GuardAgent assumes you can enumerate your safety policies in advance and encode them into a toolbox. In production Beancount deployments, some constraints are implicit (following ledger conventions the user has built up over years) and some are dynamic (budgets change, account structures get refactored). GuardAgent does not tell you how to handle constraints that can't be pre-specified. That's the harder problem, and it remains open.

  • ShieldAgent (arXiv:2503.22738, ICML 2025) — builds on GuardAgent with verifiable safety policy reasoning and ShieldAgent-Bench (2K examples across six web environments); reports 11.3% improvement over GuardAgent and 64.7% reduction in API calls
  • AGrail (arXiv:2502.11448) — proposes adaptive safety checks that transfer across agent tasks rather than requiring per-task demonstrations; addresses GuardAgent's scalability limitation directly
  • ToolSafe (arXiv:2601.10156) — proactive step-level guardrails with feedback for tool-calling agents; more granular than GuardAgent's input/output interception model