Beancount.io Blog

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

2026-07-12T00:00:00.000Z

Financial AI has been dominated by text-only RAG, but real financial documents are full of charts, tables, and figures that OCR cannot fully capture. FinRAGBench-V (EMNLP 2025) is the first large-scale benchmark to evaluate multimodal RAG with visual citations in the financial domain, and its results are a sobering reminder of how far production systems still have to go.

The paper

Zhao, Jin, Li, and Gao from Peking University introduce FinRAGBench-V, a bilingual benchmark constructed from real financial documents: research reports, financial statements, prospectuses, academic papers, magazines, and news articles. The retrieval corpus is substantial—60,780 Chinese pages and 51,219 English pages across roughly 1,100 documents per language—paired with 1,394 human-annotated QA pairs spanning seven question categories: text inference, chart and table extraction, numerical calculation, time-sensitive queries, and multi-page reasoning. Beyond the dataset, the paper's central contribution is RGenCite, a baseline system that generates answers alongside pixel-level visual citations in the form of bounding-box coordinates marking the specific document regions that support each claim.

Key ideas

Multimodal retrieval dominates text-only by a crushing margin: ColQwen2, a vision-language retriever built on page-image embeddings, achieves Recall@10 of 90.13% (Chinese) and 85.86% (English). The best text-based retrievers, BM25 and BGE-M3, top out around 42.71%. This gap is not a rounding error.
Generation accuracy is low even for frontier models: GPT-4o on English reaches 43.41% accuracy (ROUGE 24.66); o4-mini on Chinese reaches 58.13% (ROUGE 38.55). These are top proprietary models with strong retrieval in place.
Page-level citation works; block-level does not: Page-level recall sits at 75–93% for the best models. Block-level recall—knowing which specific table cell or chart region grounds a claim—drops to 20–61%. This is the key gap for auditability.
Numerical reasoning and multi-page inference break models first: Questions requiring calculations across pages or temporal spans are where accuracy falls most steeply across all tested systems.
Proprietary models substantially outperform open-source alternatives: The closed-API vs. open-source gap is larger here than on most NLP benchmarks, suggesting visual financial reasoning remains unsolved for open models.
Auto-evaluation for citations is imperfect: The image-cropping citation evaluator achieves Pearson r = 0.68 with human judgments—reasonable but not reliable enough to trust fully without sampling.

What holds up — and what doesn't

The retrieval finding is the most credible result in the paper. A gap of nearly 50 percentage points between multimodal and text-only retrievers at 60k+ pages is too large to dismiss. When you OCR a financial document before indexing, you destroy structural layout signals—which column a number appears in, whether a figure caption modifies a table's interpretation—that turn out to matter enormously for retrieval.

The generation numbers are honest but hard to interpret in isolation. The authors do not ablate how much of the accuracy gap is attributable to retrieval errors versus generation failures. Given that Recall@10 is already 85.86% for English, a meaningful fraction of failures must be generation-side rather than retrieval-side. Knowing that breakdown would clarify whether the bottleneck is multimodal reasoning or something more fundamental about how MLLMs handle financial language.

The evaluation set of 1,394 QA pairs is small for the scope of the benchmark. Split across seven categories and two languages, some slices have well under 200 examples. The statistical significance of category-level findings is left implicit. This is not unusual for a benchmark paper, but it does mean cherry-picked comparisons would be easy to construct.

The citation evaluation protocol is an interesting contribution, but Pearson r = 0.68 with human ratings is not strong enough to treat auto-evaluation as ground truth for block-level grounding. The authors acknowledge this; future work on better citation metrics is explicitly flagged.

Why this matters for finance AI

Beancount operates over plain-text ledger files, which makes text-only RAG defensible for querying past transactions. But the broader accounting task involves documents that are emphatically not plain text: bank statement PDFs, scanned invoices, receipt images, annual reports with embedded tables and charts. The moment a Beancount agent needs to reconcile a ledger entry against a source document—verify that a particular charge matches the invoice on file—it is doing exactly the task FinRAGBench-V benchmarks.

The block-level citation finding matters most for this use case. If an agent must justify a ledger entry by pointing to a specific line item in a PDF, and the best available system achieves only 20–61% block-level recall, that is not audit-ready. Any Beancount pipeline that touches scanned source documents needs human-in-the-loop review until this number improves substantially.

The retrieval modality gap also argues strongly against pure-text pipelines for document ingestion. A receipt image carries layout information—amount fields, vendor names, line-item positions—that OCR destroys. That layout information is precisely what distinguishes a line total from a tax amount, and FinRAGBench-V shows that multimodal retrievers exploit it in ways text retrievers cannot.

Can LLM Agents Be CFOs? EnterpriseArena's 132-Month Simulation Reveals a Wide Gap

2026-07-11T00:00:00.000Z

The most ambitious question in finance AI right now is not "can an LLM answer a question about a balance sheet?" but "can an LLM manage a company's money over time without running out of it?" Yi Han et al.'s Can LLM Agents Be CFOs? (arXiv:2603.23638) builds EnterpriseArena to test exactly that, and the answer is: barely, and not in the ways you'd expect.

The paper

EnterpriseArena is a 132-month (11-year) simulation of CFO-level resource allocation. Each timestep represents one month. The agent receives partial observations of firm-level financials, anonymized business documents, and macroeconomic signals drawn from FRED, CBOE, and S&P Global data. It has a budget of 20 tool calls per month spread across four operations — verifying cash position, reviewing financial records, analyzing market conditions, and projecting cash flows — and must choose one of three actions: close the books (reconciliation), request funding (equity or debt, with stochastic outcomes), or pass. The primary constraint is that the company's cash balance must stay non-negative at every timestep; violation ends the episode with a score of zero. Subject to survival, the agent maximizes terminal enterprise valuation under the scoring formula Rev_T × 5 + Cash_T − 5,000 × N_tools, which explicitly penalizes excessive tool use.

Eleven LLMs were evaluated, including Gemini-3.1-Pro, Claude-Haiku-4.5, GPT-5.4, DeepSeek-V3.1, Llama-3.3-70B, Qwen3.5-397B, and Qwen3.5-9B, alongside a human expert baseline validated by two finance professionals with 8 and 14 years of experience respectively.

Key ideas

Survival rates vary wildly across models: Qwen3.5-9B survives 80% of runs, Gemini-3.1-Pro 50%, Claude-Haiku-4.5 and GLM-5 each 20%, and GPT-5.4, DeepSeek-V3.1, Llama-3.3-70B, Mistral-Small-24B, and Mixtral-8x7B each 0%. The overall LLM average is 26%.
Larger models do not reliably outperform smaller ones: Qwen3.5-9B (9B parameters, 80% survival, $78.8M terminal valuation) decisively beats Qwen3.5-397B (397B parameters, 20% survival) and GPT-5.4 (0% survival).
The gap from humans is large: the human baseline achieves 100% survival and $152.2M ± $29.6M terminal valuation; the LLM average is $28.2M with 26% survival.
Book-closing is the critical bottleneck: human experts close the books (reconcile) at 94.3% of timesteps; LLMs average 19.3%. This is the action that produces ground-truth financial statements and enables rational subsequent decisions.
Information gathering without action is lethal: Qwen3.5-397B uses market-analysis and forecasting tools at a high rate throughout the simulation but almost never closes books (0.0% book-closing rate) and almost never requests funding, dying from cash exhaustion despite "knowing" what was happening.
The tool-budget penalty matters: the scoring formula actively punishes agents that compulsively check rather than act, a constraint that mirrors real opportunity cost.

What holds up — and what doesn't

The dual-objective design — survival as a hard constraint plus terminal valuation — is one of the strongest choices in recent agent benchmarking. It reflects how real CFOs actually operate: you cannot optimize growth if you're out of money. The anonymization of calendar dates and company identities prevents models from pattern-matching on memorized historical outcomes, which is a genuine methodological improvement over finance benchmarks that use real tickers and real dates.

The failure mode taxonomy the authors identify through case studies is credible: GPT-5.4 achieves a 99.1% pass rate (meaning it takes action at almost every timestep by doing nothing), while Qwen3.5-397B mistakes analysis for action. These are behaviorally distinct failure modes with different remedies.

What I'm less convinced by: the stochastic macro environment uses Gaussian noise to approximate market shocks, which the authors themselves acknowledge cannot replicate black-swan events or human irrationality. The tool budget of 20 calls per month is also somewhat arbitrary — real CFOs don't face this kind of query-rate constraint on their own memory, which raises the question of whether the benchmark is measuring long-horizon financial judgment or something closer to RAG-under-resource-pressure. The single-agent structure is another explicit limitation the authors name: real CFOs operate within hierarchies of controllers, FP&A analysts, and treasury teams, and the paper does not attempt to simulate this.

The finding that model size doesn't predict survival is striking and probably genuine, but the mechanism isn't well explained. The authors note it without fully unpacking whether it's a failure of instruction-following, long-context coherence, or risk calibration.

Why this matters for finance AI

The book-closing action in EnterpriseArena is essentially the Beancount balance assertion and ledger reconciliation step — the moment when the agent commits to a ground-truth view of financial state before acting. The finding that LLMs skip this 80% of the time maps directly onto the write-back safety problem: an agent that avoids reconciliation before acting is an agent that acts on stale or hallucinated state. For Beancount automation, this suggests that the reconciliation step should be mandatory and verifiable — not optional — in any agent loop.

The 132-month horizon is also directly analogous to multi-year ledger management. The finding that sustained situational awareness degrades over time is the same degradation we'd expect in a Beancount agent managing five years of transaction history: even if the agent has all the data in context, it may not act on it coherently at month 60. This suggests that periodic forced reconciliation checkpoints — not just reactive querying — are necessary in long-running Beancount agent sessions.

The information-gathering trap Qwen3.5-397B falls into is a useful design warning: agents equipped with many retrieval tools may prefer retrieval to commitment, especially when the cost of a wrong action (ledger corruption) is high. Tool-budget constraints of the kind EnterpriseArena uses could help enforce action discipline in Beancount write-back agents.

What to read next

EcoGym (arXiv:2602.09514) — complementary long-horizon economy benchmark across Vending, Freelance, and Operation environments over 1,000+ steps; no model dominates across all three, suggesting the failure modes in EnterpriseArena are not idiosyncratic to one benchmark design.
AFlow: Automating Agentic Workflow Generation (arXiv:2410.10762, ICLR 2025 oral) — reformulates workflow design as code-space search with MCTS and LLM feedback; if EnterpriseArena shows that manually designed agent behaviors fail, AFlow is the obvious next step for discovering better pipelines automatically.
ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-world APIs (arXiv:2307.16789, ICLR 2024) — the foundational tool-use training and evaluation framework; understanding how tool-calling behavior is learned in ToolLLM clarifies whether the action-avoidance failure in EnterpriseArena is a training problem or a prompting problem.

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

2026-07-10T00:00:00.000Z

The tool-use benchmarks I've been tracking — BFCL, ToolBench, τ-bench — all share a common design flaw: they construct tasks from the benchmark authors' imagination of what users do. WildToolBench, accepted at ICLR 2026, goes back to real user logs and asks what users actually do. The answer is humbling: 57 LLMs evaluated, zero exceed 15% session accuracy.

The paper

Peijie Yu, Wei Liu, Yifan Yang, and colleagues at Alibaba present WildToolBench (arXiv:2604.06185), a benchmark of 256 multi-turn dialogue scenarios with 1,024 tasks drawn from authentic user behavior patterns and grounded in ~1,600 public APIs. The core argument is that existing benchmarks are saturating not because the models are good, but because the tasks are artificial. Real users bundle requests together, omit context they shared two turns ago, and switch between asking a tool question, making small talk, and requesting a clarification — sometimes within a single message. WildToolBench operationalizes these failure modes into three structured challenge categories and measures both task-level accuracy and the much stricter session-level accuracy, which requires succeeding at all four tasks in a dialogue.

Key ideas

Session accuracy collapses to single digits for most models: Gemini-2.0-Flash-Thinking leads at 14.45% session accuracy, Claude-4-Sonnet at 12.50%, GPT-4o at 11.72%. Passing all tasks in a four-turn session is hard enough that even 60% task accuracy translates to under 15% session accuracy — a compound-probability tax on every interaction.
Compositional orchestration is the sharpest cliff: Mixed sequential-plus-parallel tool topologies cap top models at 25% task accuracy, versus 54–62% for purely parallel or sequential chains. When a task requires a parallel fan-out followed by a sequential merge, the coordination problem exceeds what any current model handles reliably.
Hidden intent is a bigger gap than anyone measured before: WildToolBench ensures 100% of tasks involve implicit or cross-turn information; BFCL v3 manages only 15.7%. Long-range dependency tasks — where the missing information is more than two turns back — are the hardest sub-type, with no model breaking 50% even at the task level.
Instruction transitions compound errors at a linear rate: Each additional policy switch (tool task → chat → clarification → tool task) drops accuracy by roughly 5–15 percentage points. At three transitions, the worst-affected models lose 30 points. The authors call this "self-conditioning": prior responses bias the model's interpretation of subsequent instructions in ways that are difficult to correct mid-session.
Optimal Path Rate stays below 43%: Even when models complete tasks correctly, they burn excess API calls. Claude-4-Sonnet achieves the best Optimal Path Rate at 42.74%, meaning the majority of correct completions take more steps than necessary — a direct cost in latency and tokens for any production system.
Specialized tool-use models underperform general frontier models: xLAM-2-70B and ToolACE2-8B both post wrong-function-name error rates exceeding 30%, worse than GPT-4o or Claude-4-Sonnet. Fine-tuning on narrow tool-use corpora appears to create brittleness rather than robustness under distribution shift to wild user behavior.

What holds up — and what doesn't

The benchmark design is strong where it matters most. The distinction between task accuracy and session accuracy is exactly right: compounding failure modes is what kills real deployments, and most prior work reports task-level numbers that mask this. The three-challenge taxonomy (compositional orchestration, hidden intent, instruction transitions) is well-motivated and empirically substantiated — the performance degradation curves across challenge types are real and striking.

The weak spot is scale. 1,024 tasks from 256 scenarios is a credible research artifact but thin for a leaderboard intended to track 57 models over time. The authors acknowledge this directly and mention an automated scaling pipeline in future work. The other issue is that "grounded in real user logs" is doing a lot of work: the final tasks are partially synthetic, constructed by a multi-agent system from seed patterns, then verified by human annotators. The claim is grounded but the data is not verbatim wild — it is wild-inspired. That matters for how literally you interpret the 15% ceiling; some fraction of the gap might close if the generation pipeline introduces artificial difficulty that real users don't actually exhibit.

I'm also skeptical of the instruction-transition analysis as an architectural claim. The paper attributes it to a fundamental limitation, but the training-distribution mismatch between RLHF fine-tuning objectives and multi-modal user sessions is the more parsimonious explanation. That is addressable, not structural.

Why this matters for finance AI

The three failure modes map almost perfectly onto how real users interact with a Beancount write-back agent. A user asks "how much did I spend on groceries last month, and while you're at it add today's Whole Foods receipt" — that is a compositional task bundled into one turn. They follow it with "actually make it $47.23 not $42, I looked it up" — that is a parameter correction requiring the agent to track session state. Then they ask "is that category right?" — that is a clarification request, and the agent needs to not re-execute the write operation it just finished. The 25% cap on mixed sequential-plus-parallel orchestration and the 30-point drop from instruction transitions are exactly the failure modes that would manifest in a ledger agent fielding real user sessions.

The finding that specialized tool-use models underperform general frontier models is particularly relevant. If we were considering fine-tuning a smaller open model on Beancount-specific tool-calling examples — the obvious cost-reduction play — WildToolBench is a direct warning that specialization may sacrifice robustness to the distribution of actual user behavior. The Optimal Path Rate finding matters too: an agent that uses twice as many API calls to complete a task is not just inefficient; for write-back operations, redundant intermediate calls can leave the ledger in inconsistent intermediate states.

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

2026-07-09T00:00:00.000Z

Last week I covered ReDAct, which routes agent decisions to an expensive fallback model when a cheap model's uncertainty exceeds a calibrated threshold. That paper does a lot of hand-waving about "uncertainty" — it's worth pausing to understand what the field actually knows about measuring and calibrating it. Geng et al.'s "A Survey of Confidence Estimation and Calibration in Large Language Models" (NAACL 2024) is the right place to start: a systematic taxonomy of what works, what doesn't, and what nobody has measured yet.

The paper

Geng, Cai, Wang, Koeppl, Nakov, and Gurevych survey the emerging literature on LLM confidence estimation and calibration across tasks ranging from multiple-choice QA to open-ended generation and machine translation. The core problem: LLMs can be both highly accurate and completely unreliable in ways that are hard to distinguish from the outside. The survey organizes the solution space into two main branches — white-box methods that exploit access to internal model states, and black-box methods that treat the model as opaque — and within each, further distinguishes between estimating confidence and calibrating it post hoc.

The paper was published at NAACL 2024 (pages 6577–6595), revised in March 2024 from a November 2023 submission by a team spanning TU Darmstadt, MBZUAI, and Mohamed bin Zayed University of AI.

Key ideas

White-box confidence via logits: The simplest approach uses token-level probabilities or length-normalized log-likelihood as a confidence signal. These methods work but face a fundamental ambiguity: low token probability can reflect low factual confidence or simply unusual phrasing — the model may be uncertain about word choice while being certain about the underlying fact.
Consistency-based black-box confidence (SelfCheckGPT): Manakul et al. (EMNLP 2023) sample multiple completions and score their mutual consistency using BERTScore, NLI, or n-gram overlap. No logit access needed. The key insight: for facts the LLM knows well, repeated samples converge; for hallucinated facts, they diverge.
Semantic entropy: Farquhar et al. (Nature, 2024) cluster semantically equivalent answers before computing entropy. An LLM might phrase "Paris" and "the French capital" differently — raw token entropy treats these as divergent, semantic entropy does not. This is a qualitative step forward over token-level consistency that the survey contextualizes.
Verbalized confidence is broken: When asked to output a confidence percentage, models collapse into overconfidence. Empirical work (Groot et al., TrustNLP at ACL 2024) finds that GPT-3, GPT-3.5, and Vicuna all show average Expected Calibration Error (ECE) exceeding 0.377 for verbalized confidence, with predictions clustering in the 90–100% range regardless of actual accuracy. Even GPT-4 — the best-calibrated model evaluated — achieves an AUROC of only ~62.7% when using verbalized confidence to discriminate correct from incorrect answers, barely above chance.
Calibration techniques vary by task: For classification, contextual calibration (subtracting class-prior bias estimated with an empty "[N/A]" prompt) and position debiasing (PriDE) address known systematic biases. For generation, Sequence Likelihood Calibration (SLiC) fine-tunes models on ranked completions. Temperature scaling — the simplest post-hoc fix — remains competitive in many settings.
No unified benchmark exists: The survey's most damning structural observation: there is no single benchmark spanning confidence estimation methods across tasks and domains. This makes it nearly impossible to rigorously compare methods. The field is evaluating apples against oranges.

What holds up — and what doesn't

The taxonomy is solid. The white-box vs. black-box distinction is genuinely useful for system design, and the treatment of logit-based methods is honest about their limits — the authors note directly that token probability conflates factual confidence with lexical uncertainty. Practitioners underestimate this conflation.

Where the survey frustrates me: it is largely descriptive. There are almost no experimental benchmarks comparing methods head-to-head, and the authors acknowledge this explicitly as a limitation. I can leave with a clear design-space map but no guidance on which method to use for a new task.

The verbalized-confidence results — GPT-4's AUROC ~62.7% on its own stated confidence — should be canonical knowledge for anyone deploying LLMs in production. It isn't. People still ship prompts that ask "on a scale of 1–10, how confident are you?" and treat the answer as meaningful. It isn't.

The survey is also thin on the RLHF calibration question: does post-training with human feedback make models better or worse calibrated? There is evidence both ways, and the survey largely sidesteps it.

Why this matters for finance AI

ReDAct stakes its safety story on having a calibrated uncertainty signal from the cheap model. The survey makes clear how hard that actually is. Logit-based signals are available in white-box settings but conflate lexical and factual uncertainty. Consistency-based methods work in black-box settings but require multiple samples per decision — expensive for a high-throughput Beancount write-back agent processing a batch of transaction entries.

The most actionable finding for Bean Labs: semantic entropy clusters semantically equivalent answers before scoring consistency, which is precisely what matters for ledger entries where a model might express the same debit/credit relationship in multiple syntactically distinct forms. A Beancount agent should use semantic clustering over sampled ledger-entry completions — not raw token-level variance — to detect when it is hallucinating an account name or amount.

The calibration failure of verbalized confidence is a direct warning for any UI that surfaces "how confident is the AI?" to the user: do not trust the number the model produces. Use an external calibrator or consistency-based method instead, or don't surface it at all.

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

2026-07-08T00:00:00.000Z

Most teams treat constrained decoding as a solved problem — add a JSON schema, get back valid JSON. JSONSchemaBench (arXiv:2501.10868) is the first systematic attempt to test that assumption against 9,558 real-world schemas, and the results are less reassuring than the marketing would suggest.

The paper

Saibo Geng, Hudson Cooper, Michał Moskal, and colleagues at Microsoft Research introduce JSONSchemaBench, a benchmark of 9,558 schemas drawn from real production sources: GlaiveAI function-call signatures, GitHub repositories stratified by complexity from trivial to ultra, Kubernetes API configs, Snowplow event analytics schemas, and the JSONSchemaStore collection. They evaluate six constrained decoding frameworks — Guidance, Outlines, Llamacpp, XGrammar, OpenAI Structured Outputs, and Gemini — across three axes: coverage (what fraction of schemas the framework can handle at all), efficiency (tokens-per-second overhead versus unconstrained generation), and quality (downstream task accuracy). The evaluation grid also includes the official JSON Schema Test Suite, which documents 45 feature categories that any compliant engine should support.

The core claim is that schema complexity is the decisive variable that separates capable frameworks from fragile ones, and that no single framework dominates across all three axes.

Key ideas

Coverage collapses under schema complexity. On simple GlaiveAI schemas all frameworks score above 86%. But on GitHub-Hard schemas — multi-level nesting, recursive definitions, complex pattern constraints — Guidance drops to 41%, Llamacpp to 39%, XGrammar to 28%, and Outlines to a catastrophic 3%. OpenAI reaches only 9% on GitHub-Hard, and Gemini produces no valid outputs at all on medium-complexity or harder schemas.
Kubernetes exposes a specific weakness in XGrammar. Despite XGrammar's speed claims, it achieves only 7% coverage on Kubernetes schemas, likely because those schemas rely on context-dependent patterns that XGrammar's context-independent precomputation cannot handle. Coverage against a benchmark that includes Kubernetes configs is not optional for production agents.
Under-constrained is more dangerous than compilation failure. XGrammar exhibits 38 under-constrained failures against the JSON Schema Test Suite — meaning it emits JSON that violates the declared schema while silently reporting success. Guidance has only 1 such failure. For a write-back agent, a compilation error is caught at design time; an under-constrained failure corrupts data at runtime without any signal.
Guidance's fast-forwarding delivers a genuine 50% speedup. When long deterministic sequences are present (e.g., field names in a fixed object structure), Guidance can advance multiple tokens per decoding step. On Llama-3.1-8B on an A100, Guidance runs at 6–9 ms per output token while unconstrained generation runs at 15–16 ms. Outlines is slower than unconstrained generation at 30–46 ms, largely due to its up-front automaton compilation taking 3–8 seconds per schema.
Constrained decoding modestly improves reasoning accuracy. On GSM8K (math), Guidance lifts accuracy from 80.1% (unconstrained) to 83.8%. On Last Letter and Shuffle Objects, gains are in the 1–3 point range. This contradicts the widely-cited concern that forcing JSON format degrades answer quality — but the effect size is small enough that format choice should not drive framework selection.
No framework covers all 45 JSON Schema feature categories. Guidance covers 13, Llamacpp and XGrammar each cover 1, and Outlines covers 0. The practical implication is that any schema using if/then/else, unevaluatedProperties, or recursive $ref definitions will behave unpredictably depending on which engine is under the hood.

What holds up — and what doesn't

The benchmark's strongest contribution is schema sourcing. Prior evaluations used toy schemas or single-source collections. Including Kubernetes configs alongside function-call signatures is the right kind of adversarial diversity. The complexity stratification (trivial through ultra) also gives practitioners a calibration curve: if your schemas look like GlaiveAI function calls, XGrammar or Guidance are both fine; if they look like Kubernetes manifests, your options narrow fast.

The main weakness is the single-sample greedy evaluation. Measuring coverage with one generation per schema understates true capability — a framework might fail 20% of the time but succeed on retry. The paper acknowledges this but doesn't report temperature-sampled pass@k numbers, which would matter for production systems that retry on failure.

The comparison also mixes incomparable models. Open-source frameworks (Guidance, Outlines, Llamacpp, XGrammar) are tested on Llama-3.2-1B, while OpenAI and Gemini run their own undisclosed models. OpenAI's 9% coverage on GitHub-Hard may reflect model capability as much as constrained decoding architecture. A fair comparison would need controlled model access — which the authors obviously cannot force from proprietary providers.

Why this matters for finance AI

Every Beancount write-back agent generates structured output. If the agent emits Beancount directives as JSON before converting to .beancount syntax, or if it calls tools via JSON schemas, the reliability of that JSON generation is not a detail — it is the whole game. The FinTrace paper showed that frontier models fail at reasoning over tool outputs; JSONSchemaBench reveals an orthogonal problem: even before reasoning, the formatting layer may silently emit non-compliant output.

The Kubernetes result is particularly telling for Beancount. Ledger schemas are not flat key-value bags. Account hierarchies, transaction metadata, and tag structures create nested recursive patterns similar to Kubernetes API objects. A framework that scores 7% on Kubernetes is not ready for complex ledger schemas, regardless of how fast its per-token overhead is.

The under-constrained failure mode is the one I would lose sleep over. A Beancount agent using XGrammar could emit a transaction that passes the framework's internal validation check but violates the actual schema — and the agent would have no reason to retry. Silent corruption is worse than visible failure.

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP

2026-07-07T00:00:00.000Z

MCP has become the de facto wiring standard for LLM tool use — Anthropic introduced it in late 2024, and by early 2026 all major model providers had adopted it. FinMCP-Bench (arXiv:2603.24943, ICASSP 2026) is the first benchmark built on real MCP tool servers specifically for financial agents, and it arrived at just the right moment to tell us whether that standardized plumbing actually helps agents do useful financial work.

The paper

Jie Zhu, Yimin Tian, and colleagues from the Alibaba Cloud Qwen DianJin team, YINGMI Wealth Management, and Soochow University present FinMCP-Bench, a 613-sample evaluation suite covering 10 financial scenario categories and 33 sub-scenarios. The tools are not mocked — 65 real MCP-compliant financial tool servers back the benchmark, drawn from actual production logs of the Qieman APP financial assistant. The authors categorize samples into three types: 145 single-tool, 249 multi-tool, and 219 multi-turn. They test six models: the Qwen3 family at 4B, 30B, and 235B parameter counts (all with extended thinking), plus DeepSeek-R1, GPT-OSS-20B, and Seed-OSS-36B. The core evaluation metrics are Tool Precision, Tool Recall, Tool F1, and an Exact Match Rate (EMR) that requires every tool call in a sequence to be exactly right.

Key ideas

MCP as the evaluation substrate: using real MCP server definitions rather than synthetic API schemas closes a major gap between benchmark evaluation and what agents actually face in deployed financial systems.
Three-way difficulty split: single-tool, multi-tool, and multi-turn samples are not just quantity differences — they expose qualitatively different failure modes.
Multi-turn collapse: the best model (Qwen3-235B) achieves 60% EMR on single-tool, 10.62% EMR on multi-tool, and 3.08% EMR on multi-turn. The drop from single to multi-turn is 20×.
Tool F1 is more forgiving: the same model scores 66.85%, 69.42%, and 41.56% TF1 across the three settings — showing that models often get the right tools but miss on ordering, parameterization, or conversation tracking.
Recall beats precision in single-tool: models tend to over-call tools when uncertain rather than under-call, which is the safer failure mode for financial tasks but still means wasted API calls and noise in the reasoning trace.
Non-monotonic size scaling: Qwen3-30B does not consistently outperform Qwen3-4B across all sub-scenarios, breaking the assumption that larger always wins for multi-step tool use.

What holds up — and what doesn't

The use of real production logs as the source for single-tool examples is the strongest methodological choice here. It grounds the benchmark in actual user behavior rather than researcher-invented scenarios, which is rare in the finance AI literature. The multi-tool and multi-turn samples are synthetically extended using dependency graphs and role-playing prompts, which is reasonable given the labeling cost, but it introduces a risk: the synthesis process tends to produce cleaner, more telegraphed queries than real users write. The 3.08% EMR on multi-turn is alarming but should be interpreted carefully — EMR requires the complete sequence to be exactly right, so a single wrong intermediate tool call fails the whole task. That's a strict and arguably unrealistic production standard; partial-credit metrics like TF1 tell a more nuanced story.

What the paper doesn't address: there is no analysis of whether the performance gap is primarily an input understanding problem (the model misinterprets what the user wants), an output formatting problem (correct intent but malformed tool call), or a reasoning problem (wrong intermediate conclusions). Without that decomposition, it's hard to know where to invest engineering effort. The paper also evaluates models in isolation; there is no test of whether adding a verification or reflection step changes the multi-turn picture.

The benchmark is also deeply tied to Qieman's specific 65 tools, which limits how well results transfer to other financial platforms with different tool inventories.

Why this matters for finance AI

FinMCP-Bench is the closest published evaluation to what a Beancount write-back agent would actually do: receive a user request, identify which tool (or chain of tools) applies, invoke them in order, and handle follow-up turns. The multi-turn EMR of 3.08% is a cold reality check. A Beancount agent that manages a multi-step ledger correction — say, reclassifying a set of transactions across accounts over a date range, then reconciling, then generating a report — is exactly the kind of multi-turn, multi-tool task that current models fail almost universally by exact-match standards.

The MCP framing is directly relevant: Beancount's Python API, beanquery interface, and fava's REST layer could all be wrapped as MCP servers. FinMCP-Bench tells us that the protocol is not the bottleneck — reasoning over tool call sequences is.

The finding that tool recall exceeds precision (models over-call) also matters for write-back safety: an agent that calls the ledger mutation tool when only a read was needed could corrupt the ledger silently. Precision-biased evaluation metrics, not recall-biased ones, should be the primary safety signal for write-back agents.

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

2026-07-06T00:00:00.000Z

FinTrace (arXiv:2604.10015) arrives one week after FinToolBench, which I logged last time, and the two papers are in direct conversation with each other. Where FinToolBench measures whether an agent calls the right tools, FinTrace asks the harder question: even when an agent calls the right tools, does it actually reason over the results? That distinction is the crux of the paper and, I think, the crux of the entire Beancount write-back agent problem.

The paper

Cao et al. introduce FinTrace, a benchmark of 800 expert-annotated trajectories spanning 34 real-world financial task categories across easy, medium, and hard difficulty tiers. The authors construct their evaluation around a rubric of nine metrics organized along four axes: action correctness (tool-calling F1, task relevance), execution efficiency (step efficiency, redundancy score), process quality (logical progression, information utilization, progress score), and output quality (task pass rate, final answer quality). They evaluate 13 LLMs and also release FinTrace-Training, a dataset of 8,196 curated preference trajectories for fine-tuning.

The central claim is that frontier models have mastered tool selection but systematically fail at the harder step: using what the tools return. The benchmark probes this with a 5-point scale for information utilization, logical progression, and progress score, plus algorithmic metrics for tool F1 and step efficiency.

Key ideas

The best-performing model, Claude-Opus-4.6, achieves a Tool-Calling F1 of 0.896 — strong selection — but scores only 3.23/5 on Information Utilization, the weakest of the four output-facing metrics.
Claude-Opus-4.6's Task Pass Rate is 2.65/5, and Final Answer Quality is 3.34/5; even the top model does not consistently produce correct, complete answers.
Qwen-3.5-9B exhibits a degenerate pattern: near-perfect Step Efficiency (1.000) and Redundancy (1.000) because it barely calls any tools, reflected in a Tool-Calling F1 of 0.109. Efficient but useless.
Training on FinTrace-Training improves intermediate process metrics (Logical Progression rises from 2.29 to 2.56 with DPO; Progress Score from 2.00 to 2.30), but Final Answer Quality stays bottlenecked — no variant significantly exceeds 1.21 average on the 1–5 scale for small models.
DPO outperforms SFT at suppressing catastrophic failure modes: the share of Logical Progression scores of 1 drops from 11.9% (SFT) to 9.5% (DPO).
The universally worst sub-category across all 13 models is Reasoning QA, where Claude-Opus-4.6 achieves only 0.62 overall — a hard ceiling shared even by the strongest frontier model.

What holds up — and what doesn't

The core finding — that tool selection and tool reasoning are dissociable — is well motivated and the four-axis rubric is a genuine contribution. Prior benchmarks like FinToolBench stop at execution traces; FinTrace adds LLM-judged process quality metrics that expose what happens in between. The inter-rater Cohen's κ of 0.89 on 100-sample validation is encouraging for a benchmark built partly on LLM judges.

That said, several methodological choices limit what I can take from the numbers at face value. The 34 task categories are not enumerated in the main paper — they're deferred to Appendix B — so I can't tell how representative they are of real-world financial practice. The difficulty tiers are defined by percentile ranks within the benchmark's own query pool, which is a circular measure: hard just means unusual relative to the other 800 trajectories, not hard in any absolute sense.

The fine-tuning analysis is frustrating. Training a 9B model on FinTrace-Training improves intermediate reasoning but final answer quality stays broken. The paper attributes this to a "disconnect" between process and output, but doesn't explain why. The most plausible explanation — that a 9B model lacks the factual recall and arithmetic capability needed for finance tasks regardless of trajectory quality — is left unaddressed. Showing DPO results only for Qwen-3.5-9B also makes it impossible to know whether larger models benefit more.

I'm also skeptical of the overall score aggregation. Combining algorithmic metrics (F1 ∈ [0,1]) with LLM-judged scores on 1–5 Likert scales by normalizing to [0,1] and averaging conflates very different failure types. A model that calls the wrong tools entirely is not the same kind of broken as a model that calls the right tools and then ignores the output.

Why this matters for finance AI

The core finding maps directly onto the Beancount write-back problem. An agent that reliably calls the right Beancount CLI tools but then misinterprets the output — say, parsing a balance sheet response and posting to the wrong account — is worse than no automation: it produces confidently wrong ledger entries that look correct to a casual reviewer.

The Information Utilization metric is the one I'd watch most carefully for any Beancount agent. The fact that the best available model scores 3.23/5 on this in a controlled financial benchmark should be a forcing constraint on any production deployment. It argues for mandatory human review of any write-back operation, at least until we see that score consistently above 4.0.

FinTrace also confirms what ReDAct suggested last week: the right architecture is not end-to-end LLM reasoning but a pipeline that externalizes verification. An agent that selects tools well (Tool F1 ~0.9) and then passes results to a separate validation step before acting is more defensible than one that tries to reason over raw tool output in a single pass.

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

2026-07-05T00:00:00.000Z

Більшість тестів фінансового ШІ перевіряють, чи може модель прочитати документ. FinToolBench перевіряє, чи може модель щось зробити — викликати живий API, отримати поточні ринкові дані та повернути правильну відповідь. Це той розрив, який має значення для будь-якої системи, що намагається автоматизувати реальну фінансову роботу, і це той розрив, на суворе усунення якого я чекав.

Стаття

Цзясюань Лу та його колеги представляють FinToolBench (arXiv:2603.08262, березень 2026 року) як те, що вони називають першим у світі реальним виконуваним тестом для оцінки агентів, які навчаються використовувати фінансові інструменти. Формулювання пряме: існуючі оцінки фінансового ШІ зосереджені на статичних питаннях та відповідях за документами, тоді як загальні тести використання інструментів, такі як ToolLLM, розглядають фінанси як просто ще одну категорію API без специфічних для галузі обмежень відповідності. FinToolBench намагається заповнити простір між цими двома видами невдач.

Бенчмарк поєднує 760 виконуваних фінансових інструментів — 261 жива кінцева точка від RapidAPI та 499 інтерфейсів від AkShare — з 295 ретельно підібраними запитами для оцінки, розділеними на 166 випадків з одним інструментом і 129 — з декількома. Інструменти охоплюють сфери акцій, облігацій, фондів, форекс, деривативів, макроекономіки та криптовалют. Важливо, що це реальні API, які можна викликати, а не макетовані заглушки. Автори також представляють FATR (Finance-Aware Tool Routing) — базового агента, що використовує пошук BGE-M3 (топ-20 кандидатів), картки інструментів з анотаціями фінансових атрибутів та планувальник ReAct з урахуванням обмежень, обмежений п'ятьма кроками.

Ключові ідеї

Виконання не є вузьким місцем — ним є міркування над результатами. GPT-4o має найвищий умовний м'який бал (CSS = 0,670), що означає, що вона дає правильні відповіді, коли успішно викликає інструмент, але викликає інструменти лише у 22,7% випадків (TIR = 0,227). Qwen3-8B викликає інструменти у 87,1% випадків, але отримує правильну відповідь лише у 40,4% випадків, коли виклик успішний.
Невідповідність намірів є домінуючою помилкою відповідності. Коефіцієнт невідповідності намірів (IMR) перевищує 50% у більшості моделей, що означає, що агенти регулярно роблять виклики з транзакційним наміром, коли запит вимагає лише інформаційного пошуку. Це серйозна проблема в регульованих фінансових контекстах.
Впровадження фінансових атрибутів допомагає дотримуватися правил без шкоди для можливостей. Картки інструментів базового рівня FATR — де кожен інструмент анотований за актуальністю, типом наміру та регуляторною областю — зменшують кількість викликів застарілих даних (TMR) та порушень доменів (DMR) без значного погіршення частоти викликів.
Запити з кількома інструментами виявляють розрив у надійності. 129 запитів з кількома інструментами вимагають ланцюжка викликів і передачі результатів між кроками; продуктивність суттєво падає порівняно з випадками з одним інструментом, що узгоджується з висновками FinTrace та TheAgentCompany.
Малі моделі можуть частіше викликати інструменти, але не перевершують великі в міркуваннях. TIR Qwen3-8B 0,871 проти 0,227 у GPT-4o показує, що менші моделі більш схильні до дії, але CER (умовна частота виконання, тобто TESR/TIR) 0,339 для Qwen3-8B проти 0,618 для GPT-4o свідчить про те, що GPT-4o набагато точніша, коли вона все ж вирішує викликати інструмент.

Що витримує критику, а що ні

Вибір бенчмарку використовувати справді живі, виконувані API є його основним внеском, і це вагомий внесок. Макетовані API були "брудним секретом" тестів використання інструментів: 16 000 API у ToolLLM звучать вражаюче, поки ви не зрозумієте, що оцінка використовує LLM як суддю того, чи "спрацював би" виклик. FinToolBench уникає цього.

Метрики відповідності (TMR, IMR, DMR) концептуально вірні — фінансові агенти повинні розуміти різницю між отриманням вчорашньої ціни закриття та ініціюванням угоди — але опис того, як ці класифікації впроваджуються в статті, є недостатнім. Незрозуміло, чи були еталонні мітки для типу наміру (інформаційний проти транзакційного) перевірені експертами з права або комплаєнсу, чи просто призначені авторами набору даних. На практиці це має велике значення.

Список моделей також дивно вузький: Doubao-Seed-1.6, Qwen3-8B, GLM-4.7-Flash та GPT-4o. Відсутні Claude Sonnet або Gemini 2.5, які були б логічними варіантами для порівняння. Таблиця результатів показує, що GPT-4o є викидом з високою точністю, але низьким охопленням; я хотів би знати, чи поведінка використання інструментів Claude ближча до консервативної моделі GPT-4o чи агресивної Qwen3-8B.

Набір оціночних запитів із 295 одиниць є малим за сучасними стандартами бенчмарків. При 760 інструментах рівень охоплення в 295 запитів означає, що більшість інструментів ніколи не тестуються. У статті не наводиться статистика охоплення за доменами, що означає, що основні показники можуть бути зумовлені підмножиною добре охоплених доменів, таких як акції та макроекономіка.

Чому це важливо для фінансового ШІ

Агенти зворотного запису Beancount — будь-який агент, який викликає bean-add, патчить файл леджера або запитує beanquery — стикаються саме з тими режимами збоїв, які виявляє FinToolBench. Проблема невідповідності намірів перекладається безпосередньо: агент Beancount, який робить виклик запису, коли користувач поставив запитання на читання, має ту саму ознаку збою, що й порушення IMR. Вимір актуальності відповідає проблемі використання застарілого кешованого стану леджера, коли користувач очікує на поточний баланс.

Напруга між точністю та охопленням (GPT-4o проти Qwen3-8B) також є безпосередньо актуальною. Для зворотного запису Beancount я б надав перевагу консервативній поведінці викликів GPT-4o — низька TIR, але високі CER та CSS — ніж моделі з високою частотою викликів, яка часто виконує не той інструмент. Помилкові записи коштують набагато дорожче, ніж відсутність дій.

Підхід FATR щодо анотування інструментів атрибутами відповідності, замість того щоб покладатися на модель для їх виведення, є шаблоном проектування, який варто перейняти. Огортання інструментів командного рядка Beancount явними метаданими про те, чи є виклик лише для читання чи для зміни даних, і чи стосується він поточного чи архівованого стану леджера — це та сама ідея, застосована до меншого масштабу.

Що почитати далі

FinTrace (arXiv:2604.10015) — оцінка на рівні траєкторії за 34 категоріями фінансових завдань з 9 метриками; безпосередньо розширює оцінку FinToolBench з одного виклику до багатокрокових послідовностей і донавчає Qwen-3.5-9B за допомогою DPO для покращення проміжних міркувань.
FinMCP-Bench (arXiv:2603.24943) — 613 зразків для 65 фінансових інструментів на базі MCP, що тестують виклики одного інструменту, кількох інструментів та багатоходові діалоги; формат MCP безпосередньо стосується інтерфейсів інструментів Beancount.
ToolLLM (arXiv:2307.16789, ICLR 2024) — стаття про ToolBench, проти якої явно позиціонується FinToolBench; розуміння того, що може і чого не може виміряти база з макетованими API, прояснює, наскільки важливою є можливість реального виконання у FinToolBench.

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

2026-07-04T00:00:00.000Z

Most RAG benchmarks in finance ask whether a system can retrieve and answer — full stop. OmniEval (EMNLP 2025, arXiv:2412.13018) from Shuting Wang et al. at RUC asks a harder question: does performance hold across the full matrix of task types and financial topics? I'm reading it now because it's the most structured attempt to map the shape of RAG failure in finance before we try to build reliable Beancount ledger agents on top of RAG pipelines.

The paper

OmniEval constructs a two-dimensional evaluation grid: five task classes (extractive QA, multi-hop reasoning, contrast QA, long-form QA, and conversational QA) crossed with 16 financial topics (stock markets, investment banking, funds, property insurance, and others). The result is a structured benchmark with 11.4k automatically generated test examples, 1.7k human-annotated examples, and a 362k-document retrieval corpus assembled from six Chinese financial data sources (BSCF-DB at 193k documents, FinGLM at 55k, BAAI-Fin at 48k, official web crawls, PDFs, and Wikipedia financial content). The benchmark also includes a fine-tuned LLM evaluator — Qwen2.5-7B-Instruct trained on 910 human-labeled instances — that scores generation quality across accuracy, hallucination, completeness, utilization, and numerical accuracy. The paper was published at EMNLP 2025.

Key ideas

The auto-generated test cases passed a human acceptance check at 87.47%, meaning roughly 1 in 8 generated instances was discarded — not a trivial noise rate for a benchmark.
Best retriever (GTE-Qwen2-1.5B) achieved MAP of 0.4370 and MRR of 0.4491 on the auto-generated set, meaning the top-ranked passage is correct less than half the time even with the strongest retriever tested.
Generation accuracy (ACC) across all retriever-LLM combinations ranged from 0.3238 to 0.4476 — the best configuration gets fewer than half the questions right.
Numerical accuracy (NAC) is the sharpest finding: 0.0659 to 0.3595. The best system gets financial numbers right about 36% of the time; the worst is near-zero.
The fine-tuned evaluator reached 74.4% agreement with human annotation (κ = 0.6486), substantially outperforming prompting-only baselines at 55–71% — but still leaving one in four evaluations misaligned with human judgment.
Multi-hop reasoning and conversational QA were consistently the hardest task classes.

What holds up — and what doesn't

The matrix evaluation design is genuinely useful. Previous finance benchmarks (FinanceBench, FinQA, DocFinQA) treat evaluation as a single axis — usually answer accuracy — and miss the structural variation in how RAG fails. Knowing that a system scores well on extractive QA but poorly on multi-hop reasoning is actionable; knowing it averages some overall score is not. The OmniEval grid makes that variation visible, and the finding that performance is inconsistent across topics is exactly the kind of result practitioners need to see before deploying.

That said, there are real limits I want to be direct about. The corpus is overwhelmingly Chinese: five of six data sources are Chinese financial data (BSCF, FinGLM, BAAI-Fin), and the sixth is Chinese Wikipedia. The paper does not report results broken out by language — it just reports aggregate numbers. This makes every score in the paper suspect as a claim about financial RAG in general, as opposed to financial RAG over Chinese text with Chinese-specialized retrievers and LLMs (GTE-Qwen2-1.5B, Qwen2.5-72B, Yi15-34B). English financial users cannot directly use these numbers.

The LLM evaluator is trained on 910 labeled instances. That is thin. The 74.4% human agreement at κ = 0.6486 is defensible as a starting point but means the eval framework itself introduces substantial noise. If the benchmark is used to compare systems that differ by a few percentage points, the evaluator variance will swamp the signal.

The automatic generation pipeline — GPT-4 produces test questions, humans filter at 87.47% acceptance — also raises a contamination question the paper does not address: GPT-4-generated questions may play to GPT-4-class models' strengths in ways that disadvantage older or smaller models systematically.

Why this matters for finance AI

The numerical accuracy scores are the number I keep coming back to: 0.0659–0.3595. If the best tested RAG system gets financial numbers right only 36% of the time in a benchmarked evaluation, any Beancount write-back agent built on top of a naive RAG pipeline is going to corrupt ledger data. Beancount's format is unforgiving — an incorrect amount, date, or account name produces either a parse error or a silent accounting error that can propagate across fiscal years. This benchmark gives us concrete evidence that RAG retrieval and LLM generation are not yet reliable enough for direct ledger write-back without a validation layer.

The task-class structure also maps cleanly to Beancount use cases. Extractive QA corresponds to simple balance lookups. Multi-hop reasoning corresponds to questions like "what is my net income after tax across Q1–Q3?" Conversational QA corresponds to a user iteratively refining a reconciliation request across a session. OmniEval's finding that multi-hop and conversational tasks are hardest is exactly the bad news for the Beancount agent design: the easy cases are almost fine; the realistic cases are where the system falls apart.

What to read next

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation (arXiv:2311.09476, NAACL 2025) — the closest general-domain analog to OmniEval's evaluator fine-tuning approach; comparing ARES methodology to OmniEval's would clarify whether the LLM-evaluator design choices are principled or ad hoc.
RAGEval: Scenario-Specific RAG Evaluation Dataset Generation Framework (ACL 2025, aclanthology.org/2025.acl-long.418) — automated scenario generation for RAG evaluation; extends the auto-generation methodology OmniEval uses and may address the contamination concern.
FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain (arXiv:2505.17471) — extends RAG evaluation to multimodal financial documents (tables, charts); relevant as Beancount users increasingly have receipt images and PDF statements alongside plain-text ledgers.

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

2026-07-03T00:00:00.000Z

The previous three entries in this thread covered AnoLLM, CausalTAD, and AD-LLM — each targeting tabular anomaly detection specifically. This survey by Ruiyao Xu and Kaize Ding, accepted to NAACL 2025 Findings, is supposed to tie those threads together into a unified landscape map. I expected a taxonomy that would clarify the design space; what I got is mostly a survey of image and video anomaly detection with a thin veneer of generality.

The paper

Xu and Ding's survey (arXiv:2409.01980) proposes organizing LLM-based anomaly and out-of-distribution (OOD) detection into two high-level classes: LLMs for Detection, where the model directly identifies anomalies, and LLMs for Generation, where the model augments training data or produces natural-language explanations that feed a downstream detector. Each class subdivides further. Detection splits into prompting-based methods (frozen or tuned LLMs queried with natural-language prompts) and contrasting-based methods (CLIP-family models that score anomalousness by comparing image patches to text descriptions). Generation splits into augmentation-centric methods (generating pseudo-OOD labels or synthetic minority samples) and explanation-centric methods (producing natural-language rationales for flagged events).

The accompanying GitHub reading list covers roughly 39 papers: 24 in detection, 10 in augmentation, and 5 in explanation.

Key ideas

Contrasting-based methods dominate image anomaly detection. WinCLIP achieves 91.8% and 85.1% AUROC on zero-shot anomaly classification and segmentation on MVTec-AD without any dataset-specific tuning, which is competitive with supervised methods trained on that dataset.
Frozen LLMs hit a modality gap for non-text data. The survey explicitly notes that "directly prompting frozen LLMs for anomaly or OOD detection results across various data types often yields suboptimal performance due to inherent modality gap between text and other data modalities."
LoRA and adapter tuning recover much of that gap. Methods like AnomalyGPT and AnomalyCLIP fine-tune with parameter-efficient techniques and substantially outperform their frozen counterparts.
Generation as augmentation is underutilized. BLIP-2-generated caption-level pseudo-OOD labels outperform word-level and description-level alternatives in OOD detection, suggesting that richer text supervision matters even for visual tasks.
Explanation-centric generation is the newest subcategory. Systems like Holmes-VAD and VAD-LLaMA go beyond binary flags to generate natural-language rationales for anomalous events, mostly in surveillance video.
Tabular data is nearly absent. The survey cites one method — "Tabular" by Li et al. (2024) — that converts tabular rows into text prompts and fine-tunes with LoRA, but provides no comparative numbers.

What holds up — and what doesn't

The two-class taxonomy is genuinely clean and I'll probably use it to organize my own thinking. The detection-vs-generation distinction captures a real architectural fork: you either ask the LLM to classify directly or you use it to build better training signal for a traditional detector.

What I can't accept is the paper's framing as a survey of anomaly detection broadly. The coverage is overwhelmingly concentrated on industrial defect images (MVTec-AD, VisA) and surveillance video (UCF-Crime, XD-Violence). Of the roughly 39 papers catalogued, almost none address tabular or financial data. Time series gets a few citations. Tabular gets one sentence. This is not a landscape map for Bean Labs — it is a landscape map for computer vision researchers who want to use CLIP for defect detection.

The authors acknowledge "space constraints prevent detailed metric summaries," which is a polite way of saying there are no comparison tables. For a survey paper, the absence of quantitative synthesis is a significant gap. Readers cannot use this paper to decide which paradigm is better for their use case without tracking down each cited paper individually.

The hallucination challenge is listed as an open problem, but the treatment is shallow — it names the risk without analyzing which detection paradigms are more or less susceptible, or how explanation-centric generation might make hallucinations more detectable through human review.

Why this matters for finance AI

Two subcategories are relevant despite the image-heavy coverage. First, the explanation-centric generation subcategory is exactly what Beancount audit agents need: not just a flag that a journal entry is anomalous, but a natural-language sentence explaining why. Financial auditors cannot act on a binary output. Second, the survey's near-total silence on tabular anomaly detection is itself informative — it confirms that the AnoLLM, CausalTAD, and AD-LLM thread I've been following is a frontier area rather than a well-trodden one, and that designing LLM-based audit tools for Beancount ledgers requires synthesizing insights from vision anomaly detection that have not yet been ported to tabular settings.

The prompting-vs-tuning trade-off is the most actionable finding: zero-shot prompting works as a first approximation but suffers from the modality gap; LoRA-based fine-tuning on representative labeled examples closes the gap. For a Beancount deployment with labeled anomaly examples from historical ledgers, the fine-tuning path appears more reliable than pure prompting.

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

2026-07-02T00:00:00.000Z

I've been thinking about the lost-in-the-middle problem ever since writing the log on Liu et al.'s original finding: pass a long context to an LLM, and it will reliably ignore evidence buried in the middle. "Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization" (Hsieh et al., ACL Findings 2024, arXiv:2406.16008) offers the most direct and practical fix I've seen: a training-free inference-time calibration that subtracts out the model's positional bias from its attention weights, recovering up to 15 percentage points of RAG accuracy.

The paper

Hsieh et al. start from a diagnostic observation: LLMs — even those trained on long contexts — exhibit a persistent U-shaped attention pattern. Tokens at the beginning and at the end of the input receive disproportionately high attention regardless of whether they are relevant, while tokens in the middle are systematically underweighted. The authors connect this empirically to the lost-in-the-middle accuracy dip rather than treating it as a separate phenomenon.

Their fix is elegant in concept. They decompose attention into two additive components: relevance (what we want) and positional bias (what we don't). To isolate the bias term, they pass a "dummy" document — uninformative filler content — through the same context at each position and record the resulting attention distribution. That dummy-document attention approximates the pure positional prior. Subtracting it from the real attention scores leaves a residual that better reflects true relevance:

Calibrated attention = Attn(document, k) − Attn(dummy, k)

The rescaled scores are then used to re-rank or re-weight retrieved documents before the final answer generation step. Critically, no training is required. The calibration is applied at inference time to the last 16 decoder layers and all attention heads. The cost is O(K) additional forward passes, where K is the number of retrieved documents — non-trivial but predictable.

Key ideas

The U-shaped attention bias is intrinsic to the model architecture and persists even in models explicitly trained with long-context objectives.
Passing a dummy (empty/noise) document through the same retrieval context isolates the positional prior; subtracting it removes bias without any finetuning.
Recall@3 on NaturalQuestion (K=20, gold document placed in the middle) jumps from 20.52% to 68.32% with calibration; at K=10, from 36.38% to 74.27%.
End-to-end QA accuracy improves by 6–15 percentage points when the gold document is mid-context; improvements hold in 22 of 24 experimental configurations.
The method outperforms six comparison baselines: vanilla attention, query-generation ranking, relevance-generation prompting, attention sorting (Peysakhovich & Lerer 2023), prompt reordering, and LongLLMLingua-rk.
The method was evaluated on NaturalQuestion (2,655 real queries over Wikipedia) and SynthWiki (990 synthetic GPT-4-generated entries).

What holds up — and what doesn't

The core result is striking and I believe it. A 20.52%→68.32% Recall@3 gap for mid-context gold documents is not the kind of number that evaporates under scrutiny — it's measuring something real about how attention is distributed. The training-free design is a genuine practical advantage: you can drop this on top of any existing RAG pipeline without touching the model weights.

That said, I have some reservations. First, the "dummy document" approach assumes that positional bias is roughly position-separable and additive — a linear decomposition that the authors themselves flag as potentially oversimplifying. Real attention bias may interact with content in non-linear ways. Second, the O(K) extra forward passes are priced as "acceptable" but never benchmarked for latency or cost. In a production system with K=20 retrievals, you're running 21 forward passes instead of 1 per query. For a Beancount agent triaging hundreds of transactions, this multiplier matters.

Third — and this is the most interesting limitation — the authors note that positional bias might actually be useful for certain tasks. Recency bias, for instance, might be what makes a model weight recent ledger entries correctly over older ones. Removing bias indiscriminately could hurt tasks where position is a valid signal. This is acknowledged but not studied.

Finally, the experiments use NaturalQuestion and a synthetic dataset. Finance-specific documents — dense tables, multi-year filings, ledger entries with repetitive structure — are very different from open-domain Wikipedia passages. The calibration would need to be validated on those distributions before claiming it will work for financial RAG.

Why this matters for finance AI

The direct connection is clear: every log since DocFinQA has been circling the same problem. When a Beancount agent retrieves 20 relevant ledger entries to answer a question like "reconcile March against the bank statement," entries from the middle of the retrieved window will be systematically underattended relative to entries at the top and bottom of the context. That's not a retrieval failure — it's a generation-side failure that no amount of retrieval-ranking improvement will fix.

The found-in-the-middle calibration is a plausible mitigation that requires no retraining of the underlying model and could be applied directly inside the generation step of any ledger QA pipeline. The O(K) cost concern is real but manageable — a 20-document retrieval window with a moderately sized model is still well within practical bounds. What I'd want to see before deploying it is a validation on Beancount-structured data specifically: does the positional correction help uniformly, or does it inadvertently suppress the recency signal that makes recent transactions more trustworthy than old ones?

The broader principle — that attention mechanisms encode positional priors independently of content relevance, and that those priors can be calibrated away without retraining — is one worth keeping. It opens the door to similar calibrations for other biases: token-frequency bias, input-length normalization, verbosity bias in generation.

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

2026-07-01T00:00:00.000Z

The pressure on autonomous agents to be both cheap and reliable pulls in opposite directions: frontier models are reliable but expensive, small models are cheap but error-prone. Piatrashyn et al.'s ReDAct paper (arXiv:2604.07036) proposes a middle path — run a small model by default and defer to a large model only when the small model is uncertain. I'm reading it because the same tension defines every production Beancount write-back agent: you want the system to handle routine categorization cheaply and to escalate non-obvious cases before they corrupt the ledger.

The paper

ReDAct (Reason-Defer-Act) builds on the ReAct prompting paradigm and introduces a two-model agent architecture. A small cheap model — Qwen3-80B, Llama3.3-70B, or Llama4-Maverick — handles every step by default. At each step it generates a reasoning trace, then generates an action. The system measures token-level uncertainty over the action generation step only and compares it against a calibrated threshold. If uncertainty exceeds that threshold, the step is re-run by a large expensive model (GPT-5.2, Qwen3-235B, or Qwen3-480B); otherwise the small model's action is executed.

The uncertainty measures are information-theoretic and require only token-level log-probabilities: Sequence Probability (summed negative log-prob), Perplexity (length-normalized), and Mean Token Entropy (average entropy across token positions). The threshold is calibrated from a held-out set of small-model rollouts by choosing the value that produces a target number of large-model calls per episode K.

Key ideas

Measure uncertainty at the action step, not the reasoning step. An auxiliary experiment on 2,411 ALFWorld steps found that reasoning-level uncertainty has poor discriminative power between correct and incorrect steps; action-level perplexity has measurably higher ROC-AUC and PRR as a predictor of correctness.
PPL deferral with Qwen3-80B + GPT-5.2 achieves 80.8% ± 1.1% on ALFWorld, exceeding GPT-5.2 alone at 78.3% ± 1.9%, while costing $16.25 vs $45.21 — approximately 64% cheaper.
~15% of steps are deferred in practice to match a calibration target of roughly 10%; the gap arises because failed (shorter) trajectories contribute disproportionately to the deferral budget.
Random deferral at the same rate scores 77.0% — still better than small-model-only (68.3%), but worse than UQ-guided deferral. The uncertainty signal genuinely matters, not just the act of calling the large model more.
MiniGrid shows less headroom. Qwen3-80B + GPT-5.2 with PPL deferral reaches 95.0% vs 99.0% for GPT-5.2 alone. The smaller task vocabulary creates a harder ceiling for the deferral approach when the small model is structurally inadequate.
Deferral distribution is task-dependent. ALFWorld defers more in later steps (longer prompt history), while MiniGrid shows a bimodal pattern tied to initial agent position. This means fixed threshold calibration generalises better within a task family than across task families.

What holds up — and what doesn't

The core empirical finding is credible: perplexity over the action string is a reasonable proxy for whether a given step is about to go wrong. The reasoning/acting decomposition in ReAct naturally provides a clean point to attach an uncertainty signal, and the auxiliary correctness-prediction experiment gives genuine mechanistic justification for the design choice.

What I'm less convinced by: the "exceeds large-model-alone" result on ALFWorld. 80.8% ± 1.1% vs 78.3% ± 1.9% overlap at one standard deviation. The authors attribute this to complementary strengths — the small model handles routine steps without the large model's occasional risk-taking — but there is no per-step ablation to verify this narrative. It could just as easily be noise.

The benchmark choice is also limiting. ALFWorld and MiniGrid are text-based household simulation and grid-world navigation — narrow environments that do not exercise tool calling, code execution, or multi-document retrieval. Whether uncertainty-calibrated deferral holds in those richer settings (the settings relevant to Beancount) is unanswered. And the choice of GPT-5.2 as the large model makes the cost numbers hard to reproduce.

The calibration procedure has an unaddressed circularity: the threshold is selected on the same distribution it was calibrated on, with no held-out validation. The authors acknowledge distribution shift between calibration (small-model rollouts) and evaluation (hybrid rollouts), but leave threshold robustness to future work.

Why this matters for finance AI

Beancount write-back agents face exactly the same deferral question at every transaction. A routine grocery purchase needs categorisation; an unusual multi-leg foreign-currency swap with a partially matched memo needs a human. The current practice is either full automation (risky) or full human review (expensive). ReDAct's framework suggests a tractable middle ground: run the cheap model and escalate when perplexity over the candidate journal entry exceeds a calibrated threshold.

The finance context adds two considerations the paper doesn't address. First, deferral here should often mean pausing and asking the user, not calling a larger LLM — the ledger's correctness standard is the user's intent, not a benchmark score. Second, the irreversibility of a committed Beancount entry is higher than a misplaced object in ALFWorld. The calibration target K should probably be tuned conservatively toward lower precision on the small model before deferring, not the other way around.

The 64% cost reduction signal is worth taking seriously even with those caveats. If a Beancount agent processes a month of transactions and only 15% of categorisation decisions need the expensive model, the economics of running a capable write-back agent look much better.

What to read next

KnowNo (Ren et al., 2023, CoRL): "Robots that ask for help: uncertainty alignment for large language model planners" — uses conformal prediction to calibrate a coverage guarantee on when to ask for help. ReDAct does not compare against it; understanding the trade-off between conformal guarantees and threshold calibration matters before choosing a production approach. [arXiv:2307.01928]
A Survey of Confidence Estimation and Calibration in Large Language Models (Guo et al. updated, NAACL 2024) — systematic taxonomy of verbalized confidence, sampling-based, and post-hoc calibration methods; the theoretical background for deciding whether perplexity is the right uncertainty proxy or whether calibrated logit scaling would perform better. [arXiv:2311.08298]
UALA: Uncertainty-Aware Language Agent (Han, Buntine, Shareghi) — applies a structurally similar uncertainty threshold to the tool invocation decision (call a tool vs rely on model knowledge), reducing tool calls by over 50%; the direct complement to ReDAct for the tool-use axis of agent uncertainty. [https://uala-agent.github.io/]

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

2026-06-30T00:00:00.000Z

I keep encountering OpenHands as the scaffolding layer beneath TheAgentCompany, InvestorBench, and a growing list of evaluation papers — yet I have not read the primary paper yet. This is the infrastructure that the rest of the field is quietly building on, so understanding what it actually provides, and where it falls short, matters more than any single benchmark result built on top of it.

The paper

OpenHands (Wang et al., 2024; ICLR 2025) is an open-source platform for building and evaluating LLM agents that act as generalist software developers. Led by Xingyao Wang and Graham Neubig across a 24-author team, the paper's core claim is that most existing agent frameworks are either too research-narrow (hard-coded task loops) or too production-narrow (closed-source or single-purpose) to serve as a shared foundation for the research community. OpenHands tries to fix that by providing a standardized runtime, a clean agent abstraction, and 15 integrated evaluation benchmarks under one MIT-licensed repo.

The runtime is a Docker-sandboxed environment containing a bash shell, a Jupyter IPython server, and a Playwright-controlled Chromium browser. Agents interact via three primary action types: IPythonRunCellAction for Python, CmdRunAction for shell commands, and BrowserInteractiveAction for web navigation. A multi-agent coordination primitive, AgentDelegateAction, lets a main agent spawn specialized sub-agents. The default backbone is CodeAct — originally published as a standalone paper arguing that code is the ideal unified action space for LLM agents — and the platform ships several agent implementations including a general CodeActAgent and a specialized BrowsingAgent.

Key ideas

Code as universal action space: CodeAct consolidates all agent actions (file edits, API calls, data transformations) into Python or bash, letting the LLM reason in the same medium it was trained on most heavily. This sidesteps the brittle JSON-schema brittleness that plagues function-calling agents.
Sandboxed Docker runtime: every agent runs in an isolated container, so agents can freely execute arbitrary code without compromising the host machine — a prerequisite for any production finance agent that might be handed real credentials.
15 benchmarks in one harness: SWE-Bench Lite (code repair), HumanEvalFix (bug fixing), WebArena (web navigation), GPQA (graduate-level reasoning), GAIA (general task-solving), and ten more. Having these colocated prevents cherry-picked evaluation.
CodeActAgent + claude-3.5-sonnet achieves 26% on SWE-Bench Lite and 79.3% on HumanEvalFix; BrowsingAgent reaches 15.5% on WebArena — competitive zero-shot without any task-specific training.
GAIA performance: 32.1% with GPTSwarm, well below the 92% human baseline — consistent with every other general agent benchmark showing a 60–70 point human-agent gap.
Community scale: 71.4K GitHub stars and 188+ contributors at the time of ICLR submission; TheAgentCompany adopted OpenHands as its evaluation harness, lending it de facto benchmark-infrastructure status.

What holds up — and what doesn't

The sandboxed runtime design is solid engineering. Isolating agent execution in Docker is the correct default for any system that might later be given write access to real financial ledgers, and it is genuinely useful that the benchmarks are co-located rather than scattered across incompatible repos.

The benchmark coverage, however, is more aspirational than systematic. The 15 benchmarks span wildly different task types and difficulty levels without a clear framework for how results should be aggregated or compared. Reporting 26% on SWE-Bench Lite alongside 79.3% on HumanEvalFix in the same paper risks creating the impression that the same agent is simultaneously mediocre and excellent — the tasks are simply not comparable. The authors do not provide a principled multi-benchmark aggregation methodology.

The CodeAct assumption — that code is the right universal action format — is contested. It works well for development tasks but imposes a Python/bash mediation layer on every action, which adds latency and breaks when the action semantics do not map cleanly to code (ambiguous user instructions, natural-language-only APIs). The paper does not benchmark against non-code action spaces to demonstrate that the advantage is real rather than confounded by the LLM backbone.

Perhaps the most important gap is the evaluation-versus-deployment split. The 26% SWE-Bench number comes from a relatively clean, well-specified benchmark. Community reports and GitHub issue threads consistently describe much lower reliability on ambiguous or long-horizon real-world tasks — the same failure mode TheAgentCompany documented. The paper does not address how to measure or improve robustness under realistic task specification noise.

Why this matters for finance AI

OpenHands is the closest thing the community has to a shared agent substrate. If Bean Labs builds evaluation infrastructure for Beancount agents, the runtime architecture here — Docker sandbox, Python/bash actions, pluggable LLM backends — is worth adopting rather than rebuilding. The AgentDelegateAction primitive maps naturally to a finance agent pipeline where a top-level orchestrator delegates to specialized sub-agents: one for ledger reads, one for anomaly flagging, one for proposed write-back that a human reviews.

The SWE-Bench and TheAgentCompany numbers, read together, establish a sobering prior: even the best available agents complete roughly 26–30% of realistic, unambiguous software tasks. Financial ledger automation is harder — transactions are often ambiguous, the blast radius of errors is real, and user intent is frequently underspecified. The right inference is not that agents are not ready, but that the first productive deployments will be tightly scoped write-once workflows (categorization suggestions, reconciliation flagging) rather than autonomous multi-step ledger edits.

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

2026-06-29T00:00:00.000Z

The trajectory of financial LLM benchmarks keeps expanding scope, and Fin-RATE is the clearest example yet of what happens when we finally ask models to do what real analysts do: track a company not just within one filing, but across multiple periods and against its industry peers.

The paper

Fin-RATE, published in February 2026 by Yidong Jiang, Junrong Chen, and colleagues at Yale and collaborating institutions, introduces a benchmark built from 2,472 SEC filings across 43 companies and 36 industries spanning 2020–2025. The benchmark organizes 7,500 expert-curated QA pairs into three task types that mirror professional analyst workflows: DR-QA (detail and reasoning within a single filing), EC-QA (cross-entity comparison of two companies under a shared topic), and LT-QA (longitudinal tracking of the same firm across reporting periods). Each task type contains 2,500 questions. The evaluation spans 17 LLMs—closed-source models including GPT-4.1 and GPT-5, open-source general models like DeepSeek-V3 and Llama-3.3-70B, and finance-specialized models like Fin-R1, Fino1-14B, FinanceConnect-13B, and TouchstoneGPT-7B. Scoring uses a unified LLM-as-Judge framework with three independent judges (GPT-5, DeepSeek-V3.2, Qwen3-235B) rating each response on correctness and five analytic dimensions.

Key ideas

Performance collapses as task complexity grows: accuracy drops 18.60% from single-document DR-QA to longitudinal LT-QA and 14.35% from DR-QA to cross-entity EC-QA, averaged across all 17 models.
GPT-5 with web search is the top performer, yet its peak accuracy sits at only 43–44% across all three task types—dismal for a benchmark meant to mirror real analyst workflows.
Fin-R1, the finance-specialized reasoning model, reaches 57.48% on DR-QA but collapses to 3.32% on EC-QA—a 54-point fall that far exceeds any general model's degradation.
Under RAG settings, performance across all models falls well below 27%, compared to gold-context performance of up to 57.48%; the retrieval pipeline, not the LLM, is the binding bottleneck.
The paper introduces a 13-type error taxonomy across four categories: hallucination and contradictions, finance-specific numerical and semantic errors, query/context understanding errors, and retrieval-level failures. Missing Evidence accounts for 75.44% of errors on the EC-QA task under RAG.
Finance-specialized models show systematically higher hallucination rates than general models on complex tasks, despite better financial terminology.

What holds up — and what doesn't

The three-pathway structure is genuinely well-designed. Most financial benchmarks (FinQA, TAT-QA, FinanceBench) treat QA as a single-document task. Fin-RATE is one of the first to explicitly model cross-entity comparison and longitudinal tracking as first-class tasks, and the results expose a fundamental gap: current LLMs handle isolated disclosure QA tolerably but fall apart the moment they need to synthesize across documents, entities, or time periods.

The Fin-R1 collapse is the paper's most striking finding and I think it is underappreciated. A finance-tuned model that excels at single-document extraction apparently trained itself into a corner: it learned templates for answering within one document, not reasoning strategies for relating entities and time periods. This is a concrete warning against narrow domain fine-tuning without explicit multi-document reasoning supervision. The model likely overfits to the shallow pattern of "find the number in the filing" and has no generalization path to "compare this number to the equivalent number in another filing from another company."

That said, there are methodology concerns worth flagging. GPT-5 is simultaneously one of the models being evaluated and one of the three judges scoring answers. The authors use three judges to reduce individual bias, which helps, but the judge-model overlap with the strongest evaluated model is uncomfortable. The paper reports high inter-judge agreement but does not separately quantify what fraction of GPT-5 responses GPT-5 itself scored, nor whether GPT-5's self-assessed scores differ systematically from the other two judges. Any self-evaluation bias would inflate the top-line result for the best-performing model in the study.

The 43-company sample is also thin. The filing type coverage is commendably broad (10-K, 10-Q, 8-K, 6-K, DEF 14A, and several S and SC series), but the same 43 companies appear across all tasks. Models that have seen these companies' disclosures in pre-training have an unquantified advantage, and the paper does not include any contamination analysis.

The retrieval finding is important but incomplete. The paper identifies that RAG performance collapses by roughly 30 points versus gold context because retrieval fails. But it benchmarks only a single retrieval setup—it treats retrieval failure as a diagnosis rather than something to systematically vary. A follow-on paper that sweeps retrieval architectures on Fin-RATE would be far more actionable.

Why this matters for finance AI

Beancount ledger audit needs exactly the two capabilities Fin-RATE reveals are broken: longitudinal tracking (how did this account evolve over fiscal years?) and cross-entity comparison (does this subsidiary's balance sheet reconcile against the consolidated statement?). The 18.60% accuracy drop under temporal tracking is a concrete number that should calibrate expectations for any Beancount agent reasoning across multiple reporting periods. If frontier models fail at 43% under gold-context longitudinal SEC QA, a Beancount agent navigating multi-year ledger histories should be designed with explicit retrieval, temporal grounding, and human escalation—not end-to-end LLM inference.

The retrieval dominance finding matters most for system design priority. If gold-context performance is nearly double RAG performance, the right investment is in better chunking, passage selection, and retrieval—not a more capable backbone LLM. This mirrors what DocFinQA found for long-context SEC filings: the pipeline around the model is the bottleneck.

The Fin-R1 warning also applies directly to the Beancount use case. Fine-tuning on Beancount DSL syntax and transaction patterns may produce a model that handles simple entry generation well but breaks under the multi-account, multi-period reconciliation that makes audit useful. Specialization without multi-document reasoning training is fragile in exactly the ways Fin-RATE measures.

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

2026-06-28T00:00:00.000Z

FinDER (arXiv:2504.15800) is a retrieval benchmark built around a simple but underappreciated observation: the queries real financial professionals type look nothing like the polished questions in academic benchmarks. I'm reading it because it sits at the intersection of two threads I've been tracking — the retrieval gap in finance AI, and the practical realism problem that DocFinQA and FinanceBench started to expose.

The paper

Chanyeol Choi, Jihoon Kwon, and colleagues at a financial AI firm present a dataset of 5,703 expert-annotated query–evidence–answer triplets sourced from a real hedge fund analyst Q&A service. The documents are Form 10-K filings from 490 S&P 500 companies, collected from SEC EDGAR. What distinguishes FinDER from prior benchmarks is the query side: 89.86% of queries contain three or more domain-specific abbreviations or acronyms. Instead of "What is the total revenue of Company X for fiscal year 2023?", a real analyst might type "GOOGL 10-K FY23 revs breakdown by segment." The dataset was published at the ICLR 2025 Workshop on Advances in Financial AI and later appeared at ICAIF 2025.

Key ideas

Retrieval recall is shockingly low across the board: E5-Mistral (best dense retriever) achieves only 25.95% context recall overall; BM25 manages 11.68%. The "Financials" category — the one most directly relevant to accounting — is the hardest: 15.84% and 6.42% respectively.
Query ambiguity alone costs 8.2 precision points: Testing E5-Mistral on 500 queries, the authors compare well-formed paraphrases (33.9 precision) against the real abbreviated queries (25.7 precision). The gap is entirely attributable to abbreviation/acronym handling, not document complexity.
Retrieval quality is the dominant bottleneck for generation: LLMs with no context score near-zero (9–10% correct); with top-10 retrieved passages they reach 29–34%; with perfect oracle context they jump to 60–68%. That 35-point gap between realistic and oracle conditions is bigger than the gap between open-source and frontier models.
Compositional arithmetic breaks even with good retrieval: Multi-step calculation tasks (compositional queries) reach only ~20% correctness across all four models — Claude-3.7-Sonnet, GPT-o1, DeepSeek-R1-Distill, and Qwen-QWQ — even with the top-10 retrieved passages. GPT-o1 leads multiplication tasks at 42.90% but falls to 27.78% on division.
LLM reranking adds modest but consistent improvement: Letting models rerank the top-10 E5-Mistral hits before answering, Claude-3.7-Sonnet achieves F1 of 63.05 and GPT-o1 reaches 62.90. Deepseek-R1-Distill trails at 60.01, despite strong performance on structured reasoning elsewhere.
Category difficulty is uneven: Risk queries are easiest to retrieve (E5-Mistral: 33.07 recall); Financials remain hardest (15.84). This correlates with query structure — risk disclosures use natural language prose, financial tables use dense numeric notation.

What holds up — and what doesn't

The core contribution is solid: this is a real query distribution from working analysts, and the abbreviation problem is genuine. Any benchmark built from Wikipedia or FinQA-style crowdsourcing misses this. The three-tier evaluation structure — no context, realistic retrieval, oracle context — is the right design; it cleanly separates retrieval quality from reasoning quality and shows the residual generation gap (still ~32–34% failure even with perfect context on qualitative questions).

Where the paper is weakest is reproducibility. At the time of publication, the dataset was not publicly available — the authors state they "plan to release it publicly at a later time." This is a significant problem for a workshop paper presenting itself as an evaluation standard. Benchmarks that aren't released are not benchmarks; they're case studies. It has since appeared at ICAIF 2025, so release may have followed, but the arXiv version does not confirm this.

The retrieval evaluation also uses only four single-stage models (BM25, GTE, mE5, E5-Mistral). There is no hybrid retrieval, no query expansion, no HyDE, no rewriting step targeting the abbreviation problem specifically. Given that the authors have precisely characterized the abbreviation gap, it's surprising they don't test the obvious fix: expand the query ("GOOGL" → "Alphabet Inc.") before retrieval. That experiment is absent.

The generation results deserve a closer read. The ~9–10% no-context performance is not a useful lower bound — it's essentially zero — but the 60–68% oracle ceiling is more informative than it appears. Even with the correct passage in hand, the best models fail on roughly one-third of qualitative questions and four-fifths of compositional arithmetic. That ceiling matters: it means retrieval alone cannot solve the problem.

Why this matters for finance AI

The query distribution in FinDER maps well onto how Beancount users actually interact with a ledger agent. A user who has been maintaining their accounts for years will type abbreviated, contextual queries — "AMZN card Q3 reimb?" rather than "What are the Amazon credit card reimbursements in Q3?" Standard embedding models will fail to retrieve the right entries because they were trained on clean natural-language text. The 8.2-point precision drop from clean to real queries is probably conservative for a personal ledger domain, where idiosyncratic shorthand ("prop mgmt fee" for "property management fee") is even further from training data than SEC-standard abbreviations.

The 25.95% context recall ceiling on E5-Mistral is a forcing function: any Beancount RAG pipeline needs to budget for a large fraction of missed evidence. One implication is that high-recall re-retrieval (multiple passes, diversified query formulations) matters more than pushing F1 on a single pass. Another is that query normalization — mapping user shorthand to canonical account names before retrieval — should be an explicit preprocessing step, not left to the embedding model.

The 20% compositional arithmetic accuracy even with oracle context is a separate signal: for Beancount calculation tasks, the generation bottleneck is reasoning, not retrieval. PAL-style offloading (generating Python arithmetic rather than free-text calculation) remains the right answer for numeric tasks regardless of how good retrieval gets.

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

2026-06-27T00:00:00.000Z

When I look back at the DocFinQA entry — where retrieval-based pipelines and long-context LLMs both collapsed on SEC filings with 123K-token contexts — the question I left hanging was why. This paper by Liu et al. (TACL 2024, arXiv:2307.03172) is the mechanistic answer, and it turns out the failure mode is simpler and more stubborn than I would have expected.

The paper

"Lost in the Middle: How Language Models Use Long Contexts" by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang runs two targeted experiments: multi-document question answering over NaturalQuestions-Open (with 10, 20, and 30 retrieved documents) and synthetic key-value retrieval (with 75, 140, and 300 pairs). In each experiment they systematically vary where the relevant document or key-value pair sits within the input context — beginning, middle, or end — while holding everything else fixed. The finding is clean: performance traces a U-shaped curve with the trough at the middle of the context, and the curve appears across every model tested.

Key ideas

The U-shape is real and consistent. In the 20-document QA setting, performance at the first position was roughly 75% and degraded to around 55% at position 10 before recovering to about 72% at position 20 — a ~20-point gap between the edges and the center.
All the models follow the same pattern. The tested models span closed and open, small and large: GPT-3.5-Turbo (4K and 16K), GPT-4, Claude-1.3 (8K and 100K), MPT-30B-Instruct, and LongChat-13B. The U-curve showed up in every one of them, including models explicitly marketed for extended context windows.
Even Claude-1.3-100K isn't immune. The 100K-context variant behaved like the others. A long context window does not mean the model actually attends uniformly across it.
The closed-book baseline sets a sobering floor. GPT-3.5-Turbo without any documents answered 56.1% of NaturalQuestions correctly; with oracle access to just the one relevant document it hit 88.3%. But at the worst middle positions in the 20-document setting, performance dropped below the closed-book baseline — meaning adding more context was actively harmful.
Encoder-decoder models (Flan-T5-XXL, Flan-UL2) are more robust within their training length but revert when contexts exceed it. The architectural difference matters, but both still degrade at scale.
The root cause is causal attention masking. Each token can only attend to preceding tokens, so positions at the very beginning accumulate more total attention weight across the model than positions in the middle. Recency effects pull the end of context up as well.

What holds up — and what doesn't

The experimental design here is admirably clean: position is the only variable being manipulated, the tasks are standard benchmarks, and the finding replicates across a wide range of model families. I have no quarrel with the core result.

What I find less convincing is the framing of the key-value retrieval task as a meaningful proxy for real use. UUID-to-UUID lookups test whether a model can parrot back a memorized string, not whether it can do anything requiring reasoning. The U-curve shows up there too, which strengthens the position-bias claim, but it also means the paper is conflating two different phenomena: retrieval accuracy on exact-match tasks and reasoning quality over relevant passages. I would want to know whether the U-shape gets worse or better when the relevant document requires multi-step inference before the final answer, not just verbatim regurgitation.

There is also a gap the authors mostly acknowledge but don't close: they never test whether instruction fine-tuning or RLHF changes the position sensitivity, only whether a larger context window does. Given that the root cause is architectural (causal masking), I suspect instruction tuning won't fix it, but the paper doesn't confirm this.

Why this matters for finance AI

This paper provides the mechanistic explanation for an empirical pattern I keep running into. DocFinQA collapsed on long SEC filings. IRCoT and FLARE both retrieve multiple passages and concatenate them before reasoning. Every RAG pipeline I've looked at in a finance context dumps retrieved passages sequentially into the prompt and hopes the model will attend to the right one.

The implication for Beancount agents is concrete. If an agent retrieves ten ledger entries as context, the entries in positions 3–7 are at highest risk of being ignored or hallucinated around. This is not a retrieval problem — it is a presentation problem. Two responses follow from this paper: either put the most diagnostically relevant entries first (and last), or don't concatenate at all and reason over one passage at a time.

The finding also complicates the long-context-LLM narrative. Every quarter a new model announces a larger context window. This paper says the window being long doesn't mean what you think it means if you are uniformly distributing evidence across it. A 128K-context model that buries the relevant transaction in position 60K is worse than a 4K-context model that retrieves precisely the right passage.

For write-back safety, the implications are uncomfortable: if the model is asked to summarize a ledger session and the relevant "do not post this transaction" policy rule appears in the middle of a long system prompt, the model may act as though it never read that rule.

What to read next

"Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" (Zhang et al., arXiv:2403.04797) — proposes Multi-scale Positional Encoding (Ms-PoE) as a training-free fix via RoPE scaling; claims up to 3.8-point improvement on Zero-SCROLLS, directly addressing the U-curve.
"Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training" (arXiv:2311.09198) — takes the opposite approach and trains the model to be explicitly position-agnostic; the comparison with Ms-PoE clarifies whether fine-tuning or inference-time tricks are the better lever.
"Mitigate Position Bias in Large Language Models via Scaling a Single Dimension" (arXiv:2406.02536) — identifies the specific positional hidden states dimension responsible for the bias and scales it without retraining; the most surgical fix proposed so far, relevant to deploying existing models without retraining.

AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection

2026-06-26T00:00:00.000Z

The last two entries in this series covered AnoLLM and CausalTAD — fine-tuned and prompt-engineered approaches to tabular anomaly detection. Before deploying either at production scale, you need to know where LLMs actually stand across a broader range of anomaly detection paradigms. That is the explicit goal of AD-LLM, which benchmarks LLMs across three distinct roles: zero-shot detector, data augmentation engine, and model selection advisor. The focus is NLP text data rather than tabular ledger entries, but the methodological lessons transfer.

The paper

Tiankai Yang, Yi Nian, and colleagues at USC and Texas A&M introduce AD-LLM (arXiv:2412.11142, ACL Findings 2025), the first benchmark to evaluate LLMs systematically across three anomaly detection paradigms on NLP datasets. The setting is one-class classification: training data contains only normal samples, and the model must flag anomalies at test time. The five datasets — AG News, BBC News, IMDB Reviews, N24 News, and SMS Spam — all derive from text classification tasks with one category designated as anomalous. The paper pits two LLMs, GPT-4o and Llama 3.1 8B Instruct, against 18 traditional unsupervised baselines that span end-to-end methods (CVDD, DATE) and two-step embedding-plus-detector combinations (OpenAI embeddings + LUNAR, LOF, Isolation Forest, etc.).

Key ideas

Zero-shot detection works well for text. GPT-4o scores AUROC of 0.9293–0.9919 across the five datasets in the Normal+Anomaly setting; Llama 3.1 reaches 0.8612–0.9487. The best traditional baseline, OpenAI + LUNAR, scores around 0.92 on AG News — GPT-4o matches or beats it without any training.
Synthetic augmentation helps, consistently but modestly. LLM-generated synthetic samples improve the OpenAI + LUNAR pipeline on all five datasets. Category description augmentation also improves most baselines, though gains are uneven — Llama 3.1 improves AUROC by +0.07 on IMDB Reviews, but results elsewhere are smaller.
Model selection is the weak link. GPT-o1-preview recommends models that surpass the average baseline performance on most datasets, and occasionally approaches the best method (e.g., on IMDB Reviews and SMS Spam). But it never reliably identifies the top performer, and the authors acknowledge the recommendations are based on simplistic inputs that lack dataset-specific statistics.
Open-source versus proprietary gap is real. GPT-4o's AUROC advantage over Llama 3.1 8B is 4–13 points depending on dataset, a gap consistent with the pattern seen in zero-shot tabular anomaly detection papers.
NLP anomaly detection still lacks a definitive benchmark. Five datasets, all derived from classification corpora, is thin. The companion NLP-ADBench paper (EMNLP Findings 2025) broadens to eight datasets and 19 algorithms but still uses the same semantic-category-as-anomaly construction that makes these tasks somewhat artificial.

What holds up — and what doesn't

The zero-shot findings are credible. Using LLMs as scorers without fine-tuning on labelled anomaly data is genuinely useful when the anomaly class is semantically coherent — a spam message differs from a ham message in ways a well-trained language model understands. The AUROC numbers are high, and the comparison against strong OpenAI-embedding-based baselines is fair.

The scope, though, is narrow in ways the paper undersells. All five datasets encode anomalies as a different topic category — spam versus legitimate SMS, news from a held-out publisher versus in-distribution outlets. This means the LLM is essentially doing topic classification, a task it is explicitly pre-trained on. The benchmark does not include semantic anomalies within a single category (e.g., unusual transactions within the same account type), which is precisely the kind of anomaly that matters for financial auditing.

The data augmentation and model selection tasks are evaluated on the same five datasets, so the paper ends up benchmarking whether LLMs can make slightly different slices of the same narrow problem marginally better. The authors freely list six limitations — including that they test only a subset of LLMs, exclude few-shot and fine-tuning regimes, and rely on simplistic inputs for model selection — which is intellectually honest but also flags how preliminary this benchmark is.

One result worth flagging for skeptics: the AUPRC scores are substantially lower than AUROC for both models. Llama 3.1 on BBC News reaches AUROC 0.8612 but only AUPRC 0.3960, reflecting the class imbalance in the one-class setup. In high-precision auditing contexts, AUPRC is the more meaningful metric, and here the picture is less flattering.

Why this matters for finance AI

The Bean Labs agenda involves two anomaly detection use cases: catching unusual ledger entries in real time (tabular, structured) and flagging suspicious narrative text in invoices, memos, or support tickets (unstructured NLP). AD-LLM speaks directly to the second case and gives us a realistic ceiling: GPT-4o can zero-shot detect topic-level anomalies in text with AUROC above 0.93 on clean, balanced datasets. That is a useful prior, but ledger narrative anomalies are subtler — an invoice memo that describes a routine service but belongs to a vendor flagged for suspicious patterns is not a topic-classification problem. The benchmark provides a starting point, not an answer.

The model-selection finding is separately interesting for system design. The dream of asking an LLM "which anomaly detector should I use on this dataset?" and getting a reliable answer does not yet pan out. That means choosing between AnoLLM-style fine-tuning, CausalTAD-style causal prompting, or a classical embedding method still requires human judgment or systematic empirical evaluation — it cannot be delegated to an LLM advisor.

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

2026-06-25T00:00:00.000Z

The previous log covered AnoLLM, which fine-tunes a small LLM to score tabular anomalies via negative log-likelihood. CausalTAD (arXiv:2602.07798) asks a sharp follow-up question: does the order in which you feed columns to that LLM matter? The answer, it turns out, is yes — and injecting causal structure into the ordering gives you a consistent, reproducible lift.

The paper

Wang et al. propose CausalTAD, a method that sits on top of AnoLLM-style LLM anomaly detectors and makes one targeted change: instead of serializing tabular rows in random or arbitrary column order, it discovers causal dependencies between columns and reorders them to respect those dependencies before the LLM reads the row.

The paper has two moving parts. First, a causal-driven column ordering module. The authors adapt the COAT factor-extraction framework: an LLM reads column metadata and samples to extract high-level semantic factors (for credit card transactions, a factor like "Compensation" might span the amount and merchant columns). From these factors, three causal discovery algorithms — PC, LiNGAM, and FCI — each build a directed causal graph over factors. The column reordering problem then becomes a Linear Ordering Problem: find the permutation π that maximizes the sum of directed-edge weights, so that cause columns appear before effect columns in the serialized text. Because the LP has many near-optimal solutions, they sample K ≈ 10 orderings within 90% of the optimum and average over them.

Second, a causal-aware reweighting module. Not all columns are equally relevant. A column that influences many factors gets a higher weight αj = |M⁻¹(cj)|, the count of factors it contributes to. The final anomaly score is the weighted average of per-column negative log-likelihoods across the K orderings.

Key ideas

Column ordering is a non-trivial inductive bias for autoregressive LLMs: placing a cause column before its effect column lets the model condition on the correct context when assigning likelihood to the effect.
Causal discovery at the factor level (rather than the raw column level) lets the method handle mixed-type tables where direct causal discovery between heterogeneous columns is noisy.
On 6 mixed-type benchmark datasets, CausalTAD with SmolLM-135M reaches average AUC-ROC 0.834 vs AnoLLM's 0.803 — a 3.1-point absolute improvement with the same backbone model.
On the Fake Job Posts dataset specifically, CausalTAD scores 0.873 vs AnoLLM's 0.800 — a 9.1% relative gain, which is large enough to matter in a real triage system.
Across 30 numerical ODDS benchmark datasets, CausalTAD achieves the best average AUC-ROC, consistently outperforming classical baselines (Isolation Forest, ECOD, KNN) and deep methods (DeepSVDD, SLAD).
All three causal discovery algorithms beat random ordering in ablation; LiNGAM edges out PC and FCI slightly on the mixed datasets.

What holds up — and what doesn't

The core claim — that causal column order helps — is well-supported. The ablation is clean: swapping random ordering for any of the three causal discovery methods improves results on the Fake Job Posts benchmark (from 0.832 to 0.870–0.873), and factor-count reweighting further helps in every configuration. That's a credible story.

What I find less convincing is the bootstrapping assumption. The causal graph is constructed by using an LLM to extract semantic factors from the very data the system is meant to analyze. If the LLM misunderstands the domain — say, for a bespoke accounting system with non-standard column names — the factor extraction will be wrong, and a bad causal graph is arguably worse than random ordering because it introduces a systematic bias. The authors acknowledge this risk ("relies on the capability of LLMs for factor extraction") but do not benchmark factor extraction accuracy independently.

There's also a computational overhead issue that is more serious than the paper suggests. Running three causal discovery algorithms, solving an LP, sampling K orderings, and then running inference on K serialized versions of every test point multiplies the inference cost by K. For a ledger with millions of entries, this matters. The paper notes "future work may focus on improving efficiency" but offers no concrete profiling.

Finally, the 30 numerical ODDS datasets are well-studied and arguably saturated for methods like this. The more meaningful signal is in the 6 mixed-type datasets — which are the realistic ones for finance — and the improvements there, while real, are somewhat modest in absolute terms.

Why this matters for finance AI

Beancount transactions have genuine causal structure: the posting amount causally drives the account selection, the account drives the counterparty expectation, and the memo text is causally downstream of all three. Random column serialization ignores this, which means an AnoLLM-style model sees "memo: groceries | account: Expenses:Food | amount: $4200" as freely as the correctly ordered version.

CausalTAD gives a principled way to encode "amount and account come first" without hardcoding it as a rule. For Bean Labs audit agents, this suggests a practical architectural choice: before scoring a batch of transactions for anomalies, spend one pass discovering the causal graph over the ledger's column schema, then use that fixed ordering for all subsequent inference. The overhead is paid once at schema-level, not per-transaction.

The credit card fraud detection example in the paper is essentially the same task structure as ledger anomaly detection: heterogeneous features, rare labels, and a causal order that domain experts know intuitively but that LLMs would otherwise ignore.

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

2026-06-24T00:00:00.000Z

The zero-shot LLM anomaly detection paper I read two days ago (arXiv:2406.16308) showed that GPT-4 could identify tabular outliers without any training, matching classical baselines like ECOD on the ODDS benchmark. But it had an obvious weakness: asking the model to output a list of anomalous row indices is fragile — open-source models routinely hallucinate indices, go out of bounds, or flag every row as suspicious. AnoLLM, published at ICLR 2025 by Che-Ping Tsai, Ganyu Teng, Phillip Wallis, and Wei Ding from Amazon, fixes that fragility while also pushing into mixed-type datasets where pure numerical baselines start to struggle.

The paper

AnoLLM reframes tabular anomaly detection as language model density estimation rather than prompted classification. Instead of asking the LLM to name which rows look suspicious, the authors fine-tune a pre-trained language model on serialized in-distribution (normal) training rows, then score each test row by its negative log-likelihood under that learned distribution. A row that looks nothing like the training distribution gets a high NLL — that is the anomaly score. No index format, no output parsing, no fragile regex extraction.

The serialization converts each table row to a natural-language string with feature names and values. For text-valued columns the NLL is normalized per column to avoid length bias, where longer descriptions would otherwise mechanically accumulate higher probability costs. For numerical and categorical columns, raw token-level NLL is summed across the field. The model is fine-tuned in a semi-supervised setting — only normal-labelled rows enter training — for up to 2,000 steps using distributed GPU training.

Key ideas

The output format problem: prior index-prediction approaches require LLMs to reliably output anomalous row indices from a batch. Llama-family models frequently pair wrong indices with values, generate indices beyond the batch size, or simply list everything as anomalous. NLL bypasses this entirely.
AnoLLM achieves best performance on six benchmark datasets with mixed feature types, including vehicle insurance fraud detection and e-commerce fraud datasets from Kaggle.
On the 30 predominantly numerical ODDS benchmark datasets, AnoLLM performs on par with top classical baselines — not clearly better, just competitive.
The NLL-per-column normalization for text features is a small but load-bearing engineering decision: without it, a transaction description with thirty tokens would dominate the score over a two-digit amount, which is the wrong inductive bias.
The training baseline context: the zero-shot GPT-4 approach (arXiv:2406.16308) achieves an average AUROC of 74.1 on ODDS, comparable to ECOD (75.5) and KNN (70.7). AnoLLM's advantage shows up specifically on datasets where text and categorical features carry meaningful anomaly signal.

What holds up — and what doesn't

The core NLL idea is sound. Using a fine-tuned language model as a density estimator over serialized rows is principled, and it naturally handles the joint distribution of all columns simultaneously — something that classical unsupervised detectors applied column-by-column cannot do cleanly. The fix to index prediction is genuinely useful and the comparison to the zero-shot baseline is fair.

What bothers me is the cost-benefit gap that the paper underreports. AnoLLM requires fine-tuning and serving an LLM for inference — a substantial infrastructure commitment compared to fitting ECOD or IsolationForest on a CPU in seconds. On the ODDS benchmark (purely numerical), AnoLLM is only "on par," not better. So the case for AnoLLM is entirely in the mixed-type regime, where the six evaluated datasets are from fraud detection on Kaggle. Six datasets is a thin empirical foundation for a strong recommendation, especially since benchmark datasets from Kaggle tend to have clean schemas, fixed column semantics, and known ground truth — all things that production ledger data often lacks.

The column ordering problem is also left open. CausalTAD (arXiv:2602.07798) immediately identified this gap: AnoLLM serializes columns in arbitrary order, ignoring the causal relationships between fields. For structured data with known causal chains — account type influences valid transaction ranges, which influence expected counterparty — this is a real limitation. CausalTAD frames reordering as a linear ordering problem and reports consistent improvement over AnoLLM across 30+ datasets. That the gap existed and was findable so quickly suggests AnoLLM's serialization design wasn't fully thought through.

There is also a scale question the paper doesn't address: at what volume of normal training examples does fine-tuning an LLM become worth it over, say, a tabular deep learning model trained directly on the numerical features? For personal Beancount ledgers with a few thousand entries, the compute cost may easily dwarf any accuracy gain.

Why this matters for finance AI

Beancount ledger entries are exactly the kind of mixed-type data AnoLLM targets: amounts (numerical), account names (structured text), payee/narration (free text), tags (categorical), dates (structured). A single row like 2024-03-15 * "AWS" "Cloud invoice" Assets:Checking -$2,400 encodes information across all these types simultaneously. Classical anomaly detectors struggle here because they need separate handling for each column type, and they lose the correlations between them — the joint pattern that "AWS" invoices should be in a certain range and hit a specific account.

AnoLLM's NLL approach would, in principle, learn these joint patterns from normal historical entries and flag deviations across any column combination. That is potentially more useful than rule-based JETs or single-column statistical tests.

That said, the double-entry accounting constraint is structural knowledge AnoLLM cannot learn from serialized rows alone — debits must equal credits, account hierarchies must be respected. These domain invariants are hard constraints, not statistical regularities, and no amount of LLM fine-tuning on historical rows will enforce them reliably if the training data contains any exceptions or rounding artifacts. The right architecture probably combines AnoLLM's NLL scoring for semantic anomalies with explicit rule checks for structural ones.

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

2026-06-23T00:00:00.000Z

This is the paper I have been waiting for since LOG-001: a direct empirical test of whether LLMs can generate valid Beancount DSL transactions from natural language financial scenarios. Figueroa et al. from Berlin University of Applied Sciences present what they claim — correctly, as far as I can tell — to be the first published evaluation of LLMs on financial transaction generation in plain-text accounting. The short answer is: they cannot, at least not reliably, even with chain-of-thought prompting and the actual Beancount balance sheet handed to them as context.

The paper

Figueroa, Grundmann, Freidank, Löser, and Nejdl evaluate five open-weight ~7B models on a two-task benchmark they call LLMFinLiteracy. Task 1 asks models to generate textual scenarios that would affect a given liquidity ratio (current, quick, or cash ratio) given a real quarterly balance sheet from one of five DAX-listed companies (Airbus, Bayer, Deutsche Telekom, Mercedes-Benz, SAP). Task 2 asks models to translate those scenarios into compilable Beancount transactions. The Beancount compiler serves as the ground-truth syntax checker; human domain experts evaluate semantic correctness. The paper introduces a 12-class error taxonomy across the two tasks and uses a 9-step chain-of-thought prompt that includes double-entry accounting rules, an input/output example, and the real company balance sheet in Beancount format. The models evaluated — Llama-3-8B, Qwen-2-7B, Mistral-7B, CodeLlama-7B, and CodeQwen-1.5-7B — were all run on-premise due to financial data sensitivity. The corpus totals 1,500 generated samples, with 300 stratified entries evaluated by human experts.

Key ideas

Only 7 of 300 evaluated scenario-transaction pairs (2.3%) were fully correct end-to-end; even restricting to the three general-purpose models raises the rate only to 3.8%.
The two best models, Qwen-2-7B and Mistral-7B, produce correct scenarios only 21.67% and 20.00% of the time, and correct compiling transactions only 16.67% and 10.00% of the time.
Code-specialized models (CodeLlama, CodeQwen) score 0% on both tasks; they responded to the prompt template with a literal "Processed — Waiting for next input" string, completely ignoring the task.
Syntax is not the bottleneck: no model produced a single syntax error. The failures are entirely in accounting reasoning — balance errors dominate for Qwen-2 (61.67%) and Llama-3 (38.33%), while Mistral mostly references accounts that do not exist in the provided balance sheet (45% unknown account errors).
A meaningful fraction of transactions that successfully compile are semantically wrong — the model's favourite trick being to call decreasing a liability "selling your debt," which increases cash but for the wrong reason.
GPT-4o used as an automated judge failed to flag inconsistencies in all 10 nonsensical scenarios it was shown, confirming that LLM self-evaluation is not a reliable quality gate for accounting outputs.
Models largely copy the input/output example in the prompt rather than generalising: the 7 correct pairs closely resemble the provided example transaction structure.

What holds up — and what doesn't

The paper's core empirical contribution is solid. The Beancount compiler is an objective, reproducible correctness criterion, and using real company balance sheets rather than toy data adds ecological validity. The hierarchical error taxonomy is thoughtfully designed — stopping evaluation at the first error avoids inflating "partial credit" for garbage outputs.

That said, there are obvious limitations the authors mostly acknowledge. Five ~7B open-weight models from 2023–2024 are a narrow slice of the capability landscape; GPT-4o and Claude were excluded for privacy reasons, which is understandable but means the headline number (2.3% correct) understates the frontier. The financial ratio formulas were deliberately withheld from prompts to test inherent domain knowledge — a methodologically interesting choice, but one that makes the results incomparable to any system that would reasonably include formula documentation. And 300 human-evaluated samples across five models, three ratios, and five companies is modest; the per-model per-ratio cells are too small (12 samples) to draw strong conclusions about variance.

The most interesting methodological gap is the absence of any iterative or feedback-based protocol. No tool-calling, no self-correction, no compiler feedback loop — just one-shot generation. Given that CRITIC (LOG-012) and related work show that tool-interactive refinement substantially improves accuracy on tasks with verifiable outputs, a Beancount-compiler-in-the-loop experiment would have been far more informative about deployability.

Why this matters for finance AI

Every design decision for the Bean Labs write-back agent rests on assumptions about what LLMs can do with Beancount DSL. This paper is the first empirical anchor. The headline findings are sobering but also interpretable in a useful way.

First, the failure modes are specific, not random. Balance errors and unknown accounts are the two dominant problems, and both are addressable with a compiler-in-the-loop feedback loop: the Beancount compiler tells you exactly which account is unknown and whether the transaction balances. An agent architecture that iterates on compiler output — rather than generating once and stopping — should substantially outperform the one-shot results here. Second, syntax is free. Models have clearly learned the Beancount surface grammar; they just cannot reliably translate financial intent into correct account movements. That distinction matters for where to invest in prompting and fine-tuning. Third, the finding that GPT-4o cannot evaluate accounting quality automatically raises the bar for any automated verification system: you need the compiler, plus domain-expert spot checks, not an LLM critic.

The paper also confirms something I suspected from the anomaly detection work (LOG-049): LLMs operating over financial transactions compile-and-submit too readily. The "Incorrect | Compiles" category — transactions that pass the syntax check but are semantically wrong — is exactly the failure mode a write-back safety guardrail must catch. A transaction can balance perfectly and still book revenue as a liability decrease, which would go undetected by any purely syntactic check.

Beancount.io Blog

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Can LLM Agents Be CFOs? EnterpriseArena's 132-Month Simulation Reveals a Wide Gap

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

Стаття​

Ключові ідеї​

Що витримує критику, а що ні​

Чому це важливо для фінансового ШІ​

Що почитати далі​

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Стаття

Ключові ідеї

Що витримує критику, а що ні

Чому це важливо для фінансового ШІ

Що почитати далі

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next