65 posts tagged with "Beancount"

AILLMMachine LearningAutomationBeancountPlain-Text AccountingData Science

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.

AILLMAutomationMachine LearningBeancountFavaWeb InterfaceOpen Source

WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do

GPT-4 completes only 14.41% of WebArena's 812 realistic web tasks while humans reach 78.24%; the dominant failure mode is false infeasibility — conservative refusal to act — with direct implications for any agent operating Fava or finance web UIs.

AILLMAutomationEnterprise SoftwareMachine LearningBeancountPlain-Text Accounting

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingAnalytics

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

τ-bench shows that top LLMs like Claude 3.5 Sonnet drop from pass@1 of 0.692 to pass@4 of 0.462 in retail customer-service tasks — a consistency cliff with direct implications for any write-back agent operating on a Beancount ledger.

AILLMMachine LearningBeancountPlain-Text AccountingQueriesData Science

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

LLMAIMachine LearningBeancountPlain-Text AccountingOpen SourceQueries

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.

AIMachine LearningLLMData SciencePlain-Text AccountingBeancountQueriesAutomation

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

AIMachine LearningDatabaseQueriesLLMBeancountAutomation

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

MAC-SQL (COLING 2025) uses three specialized agents — Selector for schema reduction, Decomposer for question decomposition, and Refiner for execution-guided SQL correction — to reach 59.59% execution accuracy on the BIRD benchmark; ablation shows the Refiner contributes the most (+4.63 points), with direct implications for Beancount ledger query generation.

AILLMDatabaseQueriesBeancountPlain-Text AccountingMachine Learning

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

DIN-SQL (NeurIPS 2023) decomposes text-to-SQL into schema linking, complexity classification, and SQL generation stages, lifting GPT-4 from 67.4% to 85.3% execution accuracy on Spider without fine-tuning — and the same decomposition strategy maps directly onto natural language interfaces for Beancount's BQL query language.

BeancountAILLMDatabaseQueriesMachine LearningPlain-Text Accounting

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.

Everything About Beancount

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

Get started with Beancount.io

Getting Started

Features

Community

Legal