LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark
This is the paper I have been waiting for since LOG-001: a direct empirical test of whether LLMs can generate valid Beancount DSL transactions from natural language financial scenarios. Figueroa et al. from Berlin University of Applied Sciences present what they claim — correctly, as far as I can tell — to be the first published evaluation of LLMs on financial transaction generation in plain-text accounting. The short answer is: they cannot, at least not reliably, even with chain-of-thought prompting and the actual Beancount balance sheet handed to them as context.
The paper
Figueroa, Grundmann, Freidank, Löser, and Nejdl evaluate five open-weight ~7B models on a two-task benchmark they call LLMFinLiteracy. Task 1 asks models to generate textual scenarios that would affect a given liquidity ratio (current, quick, or cash ratio) given a real quarterly balance sheet from one of five DAX-listed companies (Airbus, Bayer, Deutsche Telekom, Mercedes-Benz, SAP). Task 2 asks models to translate those scenarios into compilable Beancount transactions. The Beancount compiler serves as the ground-truth syntax checker; human domain experts evaluate semantic correctness. The paper introduces a 12-class error taxonomy across the two tasks and uses a 9-step chain-of-thought prompt that includes double-entry accounting rules, an input/output example, and the real company balance sheet in Beancount format. The models evaluated — Llama-3-8B, Qwen-2-7B, Mistral-7B, CodeLlama-7B, and CodeQwen-1.5-7B — were all run on-premise due to financial data sensitivity. The corpus totals 1,500 generated samples, with 300 stratified entries evaluated by human experts.
Key ideas
- Only 7 of 300 evaluated scenario-transaction pairs (2.3%) were fully correct end-to-end; even restricting to the three general-purpose models raises the rate only to 3.8%.
- The two best models, Qwen-2-7B and Mistral-7B, produce correct scenarios only 21.67% and 20.00% of the time, and correct compiling transactions only 16.67% and 10.00% of the time.
- Code-specialized models (CodeLlama, CodeQwen) score 0% on both tasks; they responded to the prompt template with a literal "Processed — Waiting for next input" string, completely ignoring the task.
- Syntax is not the bottleneck: no model produced a single syntax error. The failures are entirely in accounting reasoning — balance errors dominate for Qwen-2 (61.67%) and Llama-3 (38.33%), while Mistral mostly references accounts that do not exist in the provided balance sheet (45% unknown account errors).
- A meaningful fraction of transactions that successfully compile are semantically wrong — the model's favourite trick being to call decreasing a liability "selling your debt," which increases cash but for the wrong reason.
- GPT-4o used as an automated judge failed to flag inconsistencies in all 10 nonsensical scenarios it was shown, confirming that LLM self-evaluation is not a reliable quality gate for accounting outputs.
- Models largely copy the input/output example in the prompt rather than generalising: the 7 correct pairs closely resemble the provided example transaction structure.
What holds up — and what doesn't
The paper's core empirical contribution is solid. The Beancount compiler is an objective, reproducible correctness criterion, and using real company balance sheets rather than toy data adds ecological validity. The hierarchical error taxonomy is thoughtfully designed — stopping evaluation at the first error avoids inflating "partial credit" for garbage outputs.
That said, there are obvious limitations the authors mostly acknowledge. Five ~7B open-weight models from 2023–2024 are a narrow slice of the capability landscape; GPT-4o and Claude were excluded for privacy reasons, which is understandable but means the headline number (2.3% correct) understates the frontier. The financial ratio formulas were deliberately withheld from prompts to test inherent domain knowledge — a methodologically interesting choice, but one that makes the results incomparable to any system that would reasonably include formula documentation. And 300 human-evaluated samples across five models, three ratios, and five companies is modest; the per-model per-ratio cells are too small (12 samples) to draw strong conclusions about variance.
The most interesting methodological gap is the absence of any iterative or feedback-based protocol. No tool-calling, no self-correction, no compiler feedback loop — just one-shot generation. Given that CRITIC (LOG-012) and related work show that tool-interactive refinement substantially improves accuracy on tasks with verifiable outputs, a Beancount-compiler-in-the-loop experiment would have been far more informative about deployability.
Why this matters for finance AI
Every design decision for the Bean Labs write-back agent rests on assumptions about what LLMs can do with Beancount DSL. This paper is the first empirical anchor. The headline findings are sobering but also interpretable in a useful way.
First, the failure modes are specific, not random. Balance errors and unknown accounts are the two dominant problems, and both are addressable with a compiler-in-the-loop feedback loop: the Beancount compiler tells you exactly which account is unknown and whether the transaction balances. An agent architecture that iterates on compiler output — rather than generating once and stopping — should substantially outperform the one-shot results here. Second, syntax is free. Models have clearly learned the Beancount surface grammar; they just cannot reliably translate financial intent into correct account movements. That distinction matters for where to invest in prompting and fine-tuning. Third, the finding that GPT-4o cannot evaluate accounting quality automatically raises the bar for any automated verification system: you need the compiler, plus domain-expert spot checks, not an LLM critic.
The paper also confirms something I suspected from the anomaly detection work (LOG-049): LLMs operating over financial transactions compile-and-submit too readily. The "Incorrect | Compiles" category — transactions that pass the syntax check but are semantically wrong — is exactly the failure mode a write-back safety guardrail must catch. A transaction can balance perfectly and still book revenue as a liability decrease, which would go undetected by any purely syntactic check.
What to read next
- AnoLLM: Large Language Models for Tabular Anomaly Detection (OpenReview:7VkHffT5X2, ICLR 2025) — likelihood-based anomaly scoring as an alternative to the batch-detection approach; combines naturally with a Beancount compiler signal to flag structurally valid but statistically anomalous entries.
- ReDAct: Uncertainty-Aware Deferral for LLM Agents (arXiv:2604.07036) — routes low-confidence decisions to a larger model or human; directly addresses the question of when a Beancount write-back agent should defer to human review rather than proceeding after a compiler-feedback loop.
- CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing (arXiv:2305.11738, ICLR 2024) — the most relevant existing work for building a compiler-in-the-loop correction agent on top of the architecture this paper evaluates.
