Skip to main content

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

I've been reading the guardrail literature for a while — GuardAgent, ShieldAgent, AGrail — and they all improve detection rates while quietly admitting they can't actually guarantee anything. This ICSE NIER 2026 paper from Doshi et al. at CMU and NC State takes a different angle: instead of asking how to detect bad agent behavior more reliably, they ask how to make unsafe behavior formally impossible. It's a position paper, not an empirical study, but the framing is sharp enough to be worth reading carefully.

The paper

2026-06-05-verifiably-safe-tool-use-llm-agents-stpa-mcp

"Towards Verifiably Safe Tool Use for LLM Agents" (arXiv:2601.08012) by Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner proposes a methodology for deriving and enforcing safety specifications over LLM agent tool use. The core observation is that risks in agent systems arise primarily from tool composition — not from individual tool failures — so component-level safeguards can't catch them. An agent resolving a calendar conflict might correctly query a private health record and correctly send an email while still doing something catastrophic: leaking the contents of that record to a patient's colleagues.

The proposed solution has two parts. First, the authors apply System-Theoretic Process Analysis (STPA), a safety engineering method from aviation and nuclear systems, to identify agent-level hazards, derive safety requirements, and formalize them as specifications on data flows and tool sequences. Second, they introduce a capability-enhanced Model Context Protocol (MCP) framework where each tool must declare structured metadata: capability tier (read-only, write-only, read-write, execute), confidentiality classification, and trust level. Enforcement is then structured across four tiers: automatic blocklist for provably unsafe flows, mustlist for required sequences, allowlist for pre-approved operations, and confirmation escalation for ambiguous cases.

The formal verification step uses Alloy, a first-order relational logic tool, to model the execution space and exhaustively check that, under the stated policies, safety violations cannot occur while safe traces remain reachable. This is the primary "result" of the paper — there are no benchmark accuracy numbers, which is expected for a NIER short paper.

Key ideas

  • STPA reframes agent safety as a systems engineering problem: identify losses, trace back through hazardous interactions, derive requirements — before writing any enforcement code
  • Specifications split into two kinds: information flow constraints ("event emails must not include private data not belonging to the recipient") and temporal logic constraints ("every update_event must be followed by send_email to each attendee")
  • The four-tier enforcement (blocklist / mustlist / allowlist / confirmation) is designed to reduce security fatigue — most safe flows are preapproved, so the agent isn't constantly asking for permission
  • Alloy's exhaustive trace analysis confirmed the absence of unsafe flows in the calendar case study while preserving task functionality
  • The whole approach is explicitly scoped to task-specific agents, not general-purpose assistants — the authors acknowledge that narrow agents are more feasible to secure

What holds up — and what doesn't

The intellectual move is sound: borrowing STPA from safety-critical engineering is the right instinct. Unlike probabilistic guardrails, this approach converts requirements into predicates over system traces, which can be verified rather than estimated. The four-tier enforcement hierarchy is thoughtfully designed — the distinction between blocklist and confirmation especially matters, because permanent confirmation prompts erode user trust and get ignored.

That said, the paper's limitations are significant and mostly go unaddressed. The trust-in-metadata problem is acknowledged but not solved: the whole framework depends on tool developers labeling their tools accurately. In an open MCP marketplace where third-party tools are common, there's no enforcement mechanism for label correctness. The formal verification is also done on a hand-crafted Alloy toy model — it demonstrates feasibility of the approach, not that the approach can be applied to real systems at scale.

I also don't see a convincing argument for why STPA is the right hazard-analysis method versus, say, threat modeling or HAZOP. The calendar case study is illustrative but trivially small. And there's no discussion of adversarial tool providers who deliberately mislabel capabilities — which is a real attack surface the related MCP security literature (arXiv:2601.17549) examines in detail.

The honest framing is that this is a design proposal with a proof-of-concept formal model. The hard engineering work — building the policy engine, testing label coverage across diverse tools, measuring the autonomy-safety tradeoff empirically — is declared as future work.

Why this matters for finance AI

Beancount write-back agents face exactly the hazard pattern this paper is designed for: tool composition creating emergent risks. An agent that reads a sensitive account entry and then posts a summary to a shared ledger might be doing something perfectly reasonable in each individual step while violating a confidentiality constraint that's only visible at the system level. STPA's approach of starting from stakeholder losses and inverting them into requirements maps cleanly onto the finance domain, where the losses are regulatory violations, unauthorized disclosures, and irreversible ledger mutations.

The MCP extension is directly relevant because Beancount tooling is increasingly being wrapped as MCP servers. If those tools can declare their capability tier and confidentiality class in a structured, machine-readable way, it becomes possible to enforce data-flow policies at the protocol boundary rather than hoping the agent self-polices. The temporal logic constraints — requiring that every post_transaction be preceded by a successful balance_check — are exactly the kind of invariant a finance agent needs to guarantee before committing writes.

The missing piece, for now, is that none of this has been built and tested. But as a design vocabulary for thinking about ledger agent safety, STPA + IFC is the most principled framework I've seen in this literature.

  • "Securing AI Agents with Information-Flow Control" — arXiv:2505.23643, Microsoft Research; implements a concrete IFC system (Fides) with taint tracking, evaluated on AgentDojo; the empirical complement to this paper
  • "Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents" — arXiv:2601.17549; directly analyzes the MCP attack surface this paper's framework is meant to defend
  • "Systematic Hazard Analysis for Frontier AI using STPA" — arXiv:2506.01782; a more recent application of STPA methodology to AI systems broadly, useful for understanding how the technique scales