Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

June 5, 2026 · 6 min read

Mike Thrift

Marketing Manager

I've been reading the guardrail literature for a while — GuardAgent, ShieldAgent, AGrail — and they all improve detection rates while quietly admitting they can't actually guarantee anything. This ICSE NIER 2026 paper from Doshi et al. at CMU and NC State takes a different angle: instead of asking how to detect bad agent behavior more reliably, they ask how to make unsafe behavior formally impossible. It's a position paper, not an empirical study, but the framing is sharp enough to be worth reading carefully.

The paper

2026-06-05-verifiably-safe-tool-use-llm-agents-stpa-mcp

"Towards Verifiably Safe Tool Use for LLM Agents" (arXiv:2601.08012) by Aarya Doshi, Yining Hong, Congying Xu, Eunsuk Kang, Alexandros Kapravelos, and Christian Kästner proposes a methodology for deriving and enforcing safety specifications over LLM agent tool use. The core observation is that risks in agent systems arise primarily from tool composition — not from individual tool failures — so component-level safeguards can't catch them. An agent resolving a calendar conflict might correctly query a private health record and correctly send an email while still doing something catastrophic: leaking the contents of that record to a patient's colleagues.

The proposed solution has two parts. First, the authors apply System-Theoretic Process Analysis (STPA), a safety engineering method from aviation and nuclear systems, to identify agent-level hazards, derive safety requirements, and formalize them as specifications on data flows and tool sequences. Second, they introduce a capability-enhanced Model Context Protocol (MCP) framework where each tool must declare structured metadata: capability tier (read-only, write-only, read-write, execute), confidentiality classification, and trust level. Enforcement is then structured across four tiers: automatic blocklist for provably unsafe flows, mustlist for required sequences, allowlist for pre-approved operations, and confirmation escalation for ambiguous cases.

The formal verification step uses Alloy, a first-order relational logic tool, to model the execution space and exhaustively check that, under the stated policies, safety violations cannot occur while safe traces remain reachable. This is the primary "result" of the paper — there are no benchmark accuracy numbers, which is expected for a NIER short paper.

Key ideas

STPA reframes agent safety as a systems engineering problem: identify losses, trace back through hazardous interactions, derive requirements — before writing any enforcement code
Specifications split into two kinds: information flow constraints ("event emails must not include private data not belonging to the recipient") and temporal logic constraints ("every update_event must be followed by send_email to each attendee")
The four-tier enforcement (blocklist / mustlist / allowlist / confirmation) is designed to reduce security fatigue — most safe flows are preapproved, so the agent isn't constantly asking for permission
Alloy's exhaustive trace analysis confirmed the absence of unsafe flows in the calendar case study while preserving task functionality
The whole approach is explicitly scoped to task-specific agents, not general-purpose assistants — the authors acknowledge that narrow agents are more feasible to secure

What holds up — and what doesn't

The intellectual move is sound: borrowing STPA from safety-critical engineering is the right instinct. Unlike probabilistic guardrails, this approach converts requirements into predicates over system traces, which can be verified rather than estimated. The four-tier enforcement hierarchy is thoughtfully designed — the distinction between blocklist and confirmation especially matters, because permanent confirmation prompts erode user trust and get ignored.

That said, the paper's limitations are significant and mostly go unaddressed. The trust-in-metadata problem is acknowledged but not solved: the whole framework depends on tool developers labeling their tools accurately. In an open MCP marketplace where third-party tools are common, there's no enforcement mechanism for label correctness. The formal verification is also done on a hand-crafted Alloy toy model — it demonstrates feasibility of the approach, not that the approach can be applied to real systems at scale.

I also don't see a convincing argument for why STPA is the right hazard-analysis method versus, say, threat modeling or HAZOP. The calendar case study is illustrative but trivially small. And there's no discussion of adversarial tool providers who deliberately mislabel capabilities — which is a real attack surface the related MCP security literature (arXiv:2601.17549) examines in detail.

The honest framing is that this is a design proposal with a proof-of-concept formal model. The hard engineering work — building the policy engine, testing label coverage across diverse tools, measuring the autonomy-safety tradeoff empirically — is declared as future work.

Why this matters for finance AI

Beancount write-back agents face exactly the hazard pattern this paper is designed for: tool composition creating emergent risks. An agent that reads a sensitive account entry and then posts a summary to a shared ledger might be doing something perfectly reasonable in each individual step while violating a confidentiality constraint that's only visible at the system level. STPA's approach of starting from stakeholder losses and inverting them into requirements maps cleanly onto the finance domain, where the losses are regulatory violations, unauthorized disclosures, and irreversible ledger mutations.

The MCP extension is directly relevant because Beancount tooling is increasingly being wrapped as MCP servers. If those tools can declare their capability tier and confidentiality class in a structured, machine-readable way, it becomes possible to enforce data-flow policies at the protocol boundary rather than hoping the agent self-polices. The temporal logic constraints — requiring that every post_transaction be preceded by a successful balance_check — are exactly the kind of invariant a finance agent needs to guarantee before committing writes.

The missing piece, for now, is that none of this has been built and tested. But as a design vocabulary for thinking about ledger agent safety, STPA + IFC is the most principled framework I've seen in this literature.

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next