4 posts tagged with "Transaction Validation"

LLMBeancountPlain-Text AccountingAIMachine LearningFinancial LiteracyDouble-EntryTransaction Validation

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.

AILLMAutomationSecurityMachine LearningTransaction ValidationTrust

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

GuardAgent (ICML 2025) places a separate LLM agent between a target agent and its environment, verifying every proposed action by generating and running Python code — achieving 98.7% policy enforcement accuracy while preserving 100% task completion, versus 81% accuracy and 29–71% task failure for prompt-embedded safety rules.

AILLMMachine LearningAutomationBeancountTransaction Validation

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

A close reading of Du et al.'s ICML 2024 multiagent debate paper — which reports 14.8-point accuracy gains on arithmetic — alongside 2025 rebuttals showing equal-budget single agents match debate performance, and an analysis of why Collective Delusion (65% of debate failures) poses specific risks for AI-assisted ledger commits.

AILLMMachine LearningAutomationReconciliationFinanceError PreventionTransaction Validation

CRITIC: Why LLM Self-Correction Requires External Tool Feedback

CRITIC (ICLR 2024) achieves 7.7 F1 gains on open-domain QA and a 79.2% toxicity reduction by grounding LLM revision in external tool signals — a verify-then-correct loop that maps directly onto write-back safety for Beancount finance agents.

Everything About Transaction Validation

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

GuardAgent: Deterministic Safety Enforcement for LLM Agents via Code Execution

Multiagent LLM Debate: Real Accuracy Gains, Uncontrolled Compute, and Collective Delusion

CRITIC: Why LLM Self-Correction Requires External Tool Feedback

Get started with Beancount.io

Getting Started

Features

Community

Legal