Bean Labs Research Log

AILLMAutomationReconciliationBeancountCash FlowFinancial ManagementForecasting

Can LLM Agents Be CFOs? EnterpriseArena's 132-Month Simulation Reveals a Wide Gap

EnterpriseArena runs 11 LLMs through a 132-month CFO simulation tracking survival, terminal valuation, and book-closing rates. Only Qwen3.5-9B survives 80% of runs; GPT-5.4 and DeepSeek-V3.1 hit 0%. Human experts achieve 100% survival at 5× the terminal value. The critical bottleneck: LLMs skip ledger reconciliation 80% of the time, acting on stale financial state.

AILLMAutomationMachine LearningBeancountData ScienceTechnology

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

WildToolBench (ICLR 2026) evaluates 57 LLMs on 1,024 tasks drawn from real user behavior — no model exceeds 15% session accuracy, with compositional orchestration, hidden intent, and instruction transitions as the three sharpest failure modes.

LLMAIMachine LearningTrustFinanceData ScienceHallucination Detection

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

A systematic survey of LLM confidence estimation and calibration methods—white-box logit approaches, consistency-based SelfCheckGPT, and semantic entropy—reveals that verbalized confidence scores from GPT-4 achieve only ~62.7% AUROC, barely above chance, with direct implications for deploying uncertainty-aware agents in finance and accounting.

LLMAIMachine LearningAutomationBeancountPerformance

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

JSONSchemaBench tests 9,558 real-world JSON schemas against six constrained decoding frameworks and finds that schema complexity causes coverage to collapse from 86% on simple schemas to 3% on complex ones, with XGrammar silently emitting 38 non-compliant outputs and no framework covering all 45 JSON Schema feature categories.

AILLMAutomationBeancountFintechMachine LearningReconciliation

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP

FinMCP-Bench evaluates six LLM models on 613 real-world financial tool-use tasks backed by 65 MCP servers — the best model scores 3.08% exact match on multi-turn tasks, revealing a 20× performance collapse from single-tool to multi-turn scenarios.

LLMAIFinanceFintechAutomationBeancountMachine Learning

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

FinTrace benchmarks 13 LLMs on 800 expert-annotated financial task trajectories across 9 metrics, finding that frontier models achieve strong tool selection (F1 ~0.9) but score only 3.23/5 on information utilization — the step where agents reason over what tools return.

AILLMAutomationMachine LearningFintechBeancountComplianceData Science

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

FinToolBench поєднує 760 активних фінансових інструментів API з 295 виконуваними запитами для тестування агентів LLM на реальних фінансових завданнях — виявивши, що консервативна частота викликів GPT-4o у 22,7% забезпечує вищу якість відповідей (CSS 0,670), ніж агресивна TIR Qwen3-8B у 87,1%, тоді як невідповідність намірів перевищує 50% у всіх протестованих моделях.

AIMachine LearningLLMFinanceData ScienceBeancountAutomation

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

OmniEval (EMNLP 2025) benchmarks RAG systems across 5 task types × 16 financial topics using 11.4k auto-generated test cases. The best systems achieve only 36% numerical accuracy — concrete evidence that RAG pipelines need validation layers before writing to structured financial ledgers.

AILLMMachine LearningFraud DetectionData ScienceBeancountAnalytics

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

A critical reading of Xu and Ding's NAACL 2025 survey on LLM-based anomaly and OOD detection: the detection-vs-generation taxonomy holds up, but near-total absence of tabular coverage means financial AI practitioners must synthesize insights from vision models themselves.

FinRAGBench-V: Multimodal RAG with Visual Citations in the Financial Domain

Latest articles

Can LLM Agents Be CFOs? EnterpriseArena's 132-Month Simulation Reveals a Wide Gap

WildToolBench: Why No LLM Exceeds 15% Session Accuracy in Real-World Tool Use

LLM Confidence and Calibration: A Survey of What the Research Actually Shows

JSONSchemaBench: Real-World Schema Complexity Breaks LLM Structured Output Guarantees

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under MCP

FinTrace: Trajectory-Level Evaluation of LLM Tool Calling for Financial Tasks

FinToolBench: Оцінка агентів LLM на основі використання фінансових інструментів у реальних умовах

OmniEval: Omnidirectional RAG Evaluation Benchmark for the Financial Domain

LLM Anomaly Detection Survey (NAACL 2025): Strong Taxonomy, Absent Tabular Coverage

Get started with Beancount.io

Getting Started

Features

Community

Legal