FinToolBench поєднує 760 активних фінансових інструментів API з 295 виконуваними запитами для тестування агентів LLM на реальних фінансових завданнях — виявивши, що консервативна частота викликів GPT-4o у 22,7% забезпечує вищу якість відповідей (CSS 0,670), ніж агресивна TIR Qwen3-8B у 87,1%, тоді як невідповідність намірів перевищує 50% у всіх протестованих моделях.
CMU and NC State researchers propose using System-Theoretic Process Analysis (STPA) and a capability-enhanced Model Context Protocol to derive formal safety specifications for LLM agent tool use, with Alloy-based verification demonstrating absence of unsafe flows in a calendar scheduling case study.
FinAuditing tests 13 LLMs zero-shot on 1,102 real SEC XBRL filing instances; top scores are 13.86% on financial math verification and 12.42% on concept retrieval—results that directly bound what AI accounting tools can be trusted to automate without external tooling.
AGrail (ACL 2025) introduces a two-LLM cooperative guardrail that adapts safety checks at inference time via test-time adaptation, achieving 0% prompt injection attack success and 95.6% benign action preservation on Safe-OS — compared to GuardAgent and LLaMA-Guard blocking up to 49.2% of legitimate actions.
ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.
AuditCopilot applies open-source LLMs (Mistral-8B, Gemma, Llama-3.1) to corporate journal entry fraud detection, cutting false positives from 942 to 12 — but ablation reveals the LLM functions primarily as a synthesis layer on top of Isolation Forest scores, not as an independent anomaly detector.
Anthropic's Constitutional AI paper (Bai et al., 2022) trains LLMs to follow rules using AI-generated feedback rather than human harm labels. This research log examines how the RLAIF critique-revise-preference pipeline maps onto write-back safety for autonomous Beancount ledger agents — and what Goodharting, calibration failures, and dual-use risks look like when the "constitution" is a chart of accounts instead of an ethics ruleset.