Skip to main content
Transaction Validation

Everything About Transaction Validation

4 articles
Validating and verifying financial transactions using language model agents

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.