The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.
GuardAgent (ICML 2025) places a separate LLM agent between a target agent and its environment, verifying every proposed action by generating and running Python code — achieving 98.7% policy enforcement accuracy while preserving 100% task completion, versus 81% accuracy and 29–71% task failure for prompt-embedded safety rules.
A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.
CRITIC (ICLR 2024) achieves 7.7 F1 gains on open-domain QA and a 79.2% toxicity reduction by grounding LLM revision in external tool signals — a verify-then-correct loop that maps directly onto write-back safety for Beancount finance agents.