65 posts tagged with "Beancount"

AILLMAutomationMachine LearningBeancountDecision-makingPlain-Text AccountingTrust

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.

AIOpen SourceAutomationLLMDevelopersBeancountPlain-Text AccountingMachine Learning

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.

AILLMMachine LearningFinanceBeancountData ScienceFinancial Reporting

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

FinDER benchmarks RAG on 5,703 real hedge fund analyst queries against S&P 500 10-K filings; E5-Mistral achieves only 25.95% context recall, and abbreviation-heavy queries cost 8.2 precision points — evidence that query normalization, not better embeddings, is the first fix for finance AI pipelines.

LLMAIMachine LearningFraud DetectionData ScienceAnomaly DetectionBeancount

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

CausalTAD improves LLM-based tabular anomaly detection by reordering table columns to respect causal dependencies before serialization, lifting average AUC-ROC from 0.803 to 0.834 over AnoLLM on mixed-type benchmarks — with direct implications for detecting anomalies in structured ledger data.

AILLMMachine LearningFraud DetectionData ScienceBeancountFinance

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

AnoLLM (ICLR 2025) reformulates tabular anomaly detection as LLM density estimation — fine-tuning on normal rows and scoring by negative log-likelihood. It outperforms classical methods on mixed-type fraud datasets but offers no edge on purely numerical data, with real implications for detecting anomalies in Beancount ledger entries.

LLMBeancountPlain-Text AccountingAIMachine LearningFinancial LiteracyDouble-EntryTransaction Validation

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.

AILLMMachine LearningBeancountAutomationData ScienceQueriesPlain-Text Accounting

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

AILLMFraud DetectionMachine LearningData ScienceBeancountAutomation

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

GPT-4 achieves 74.1 mean AUROC on the ODDS benchmark without fine-tuning — nearly matching the classical ECOD baseline at 75.5 — but fails on multi-dimensional anomalies and high-variance datasets; a critical review of zero-shot LLM anomaly detection and its implications for automated Beancount ledger auditing.

AILLMMachine LearningFinanceFinancial ReportingData ScienceBeancount

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.

AILLMAutomationBeancountPlain-Text AccountingMachine Learning

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

τ²-bench extends agent benchmarking to dual-control settings where both the AI and the user invoke tools over shared state — finding that active users cut success rates by 18–25 percentage points, with direct implications for Beancount agents sharing write access with human users.

Everything About Beancount

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

FinDER: Real Analyst Queries Expose a 74% Recall Gap in Financial RAG

CausalTAD: Causal Column Ordering for LLM Tabular Anomaly Detection

AnoLLM: Fine-Tuning LLMs for Tabular Anomaly Detection in Financial Data

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

Get started with Beancount.io

Getting Started

Features

Community

Legal