Bean Labs Research Log

AILLMAutomationMachine LearningBeancountDecision-makingPlain-Text AccountingTrust

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.

AIOpen SourceAutomationLLMDevelopersBeancountPlain-Text AccountingMachine Learning

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.

LLMAIMachine LearningAnalyticsFinancial ReportingData ScienceReconciliation

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

Fin-RATE benchmarks 17 LLMs on 7,500 expert-curated QA pairs from 2,472 SEC filings, revealing an 18.60% accuracy collapse under longitudinal tracking and a 54-point drop for finance-specialized Fin-R1 on cross-entity tasks — with the retrieval pipeline, not the backbone model, as the binding bottleneck.

AILLMMachine LearningFinanceBeancountData ScienceFinancial Reporting

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.

LLMAIMachine LearningData ScienceFinanceTechnologyAnalytics

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

The TACL 2024 paper by Liu et al. shows LLMs perform up to 20 points worse on information buried in the middle of long contexts — a U-shaped degradation affecting every tested model including Claude-1.3-100K — with concrete implications for how RAG pipelines should order retrieved passages in finance and accounting applications.

LLMAIMachine LearningData ScienceFraud DetectionAnalyticsAnomaly Detection

AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection

AD-LLM benchmarks GPT-4o and Llama 3.1 8B across three anomaly detection roles — zero-shot detector, data augmenter, and model selector — on five NLP datasets; GPT-4o reaches AUROC 0.93–0.99 zero-shot, but LLM-based model selection remains unreliable, with direct implications for financial audit AI.

LLMAIMachine LearningFraud DetectionData ScienceAnomaly DetectionBeancount

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

CausalTAD improves LLM-based tabular anomaly detection by reordering table columns to respect causal dependencies before serialization, lifting average AUC-ROC from 0.803 to 0.834 over AnoLLM on mixed-type benchmarks — with direct implications for detecting anomalies in structured ledger data.

AILLMMachine LearningFraud DetectionData ScienceBeancountFinance

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.

LLMBeancountPlain-Text AccountingAIMachine LearningFinancial LiteracyDouble-EntryTransaction Validation

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.

Found in the Middle: Calibrating Positional Attention Bias Improves Long-Context RAG

Latest articles

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

Fin-RATE: How LLMs Fail at Cross-Period and Cross-Entity Financial Analysis

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

Lost in the Middle: Position Bias in LLMs and Its Impact on Finance AI

AD-LLM Benchmark: GPT-4o Hits 0.93+ AUROC Zero-Shot for Text Anomaly Detection

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

Get started with Beancount.io

Getting Started

Features

Community

Legal