33 posts tagged with "Plain-Text Accounting"

AILLMAutomationMachine LearningBeancountDecision-makingPlain-Text AccountingTrust

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.

AIOpen SourceAutomationLLMDevelopersBeancountPlain-Text AccountingMachine Learning

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.

LLMBeancountPlain-Text AccountingAIMachine LearningFinancial LiteracyDouble-EntryTransaction Validation

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

The LLMFinLiteracy benchmark finds that five open-weight ~7B models generate fully correct Beancount transactions only 2.3% of the time, with failures concentrated in accounting reasoning—not syntax—pointing to compiler-in-the-loop feedback as the critical missing ingredient for reliable write-back agents.

AILLMMachine LearningBeancountAutomationData ScienceQueriesPlain-Text Accounting

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

AILLMAutomationBeancountPlain-Text AccountingMachine Learning

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

τ²-bench extends agent benchmarking to dual-control settings where both the AI and the user invoke tools over shared state — finding that active users cut success rates by 18–25 percentage points, with direct implications for Beancount agents sharing write access with human users.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingData Science

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.

AILLMAutomationEnterprise SoftwareMachine LearningBeancountPlain-Text Accounting

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingAnalytics

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

τ-bench shows that top LLMs like Claude 3.5 Sonnet drop from pass@1 of 0.692 to pass@4 of 0.462 in retail customer-service tasks — a consistency cliff with direct implications for any write-back agent operating on a Beancount ledger.

AILLMMachine LearningBeancountPlain-Text AccountingQueriesData Science

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

LLMAIMachine LearningBeancountPlain-Text AccountingOpen SourceQueries

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.

Everything About Plain-Text Accounting

Uncertainty-Aware Deferral for LLM Agents: When to Escalate from Small to Large Models

OpenHands: Open Platform for AI Software Agents and What It Means for Finance Automation

LLMs Score 2.3% on Beancount DSL Generation: The LLMFinLiteracy Benchmark

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

Get started with Beancount.io

Getting Started

Features

Community

Legal