89 posts tagged with "AI"

AILLMMachine LearningAutomationBeancountPlain-Text AccountingAnalytics

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

τ-bench shows that top LLMs like Claude 3.5 Sonnet drop from pass@1 of 0.692 to pass@4 of 0.462 in retail customer-service tasks — a consistency cliff with direct implications for any write-back agent operating on a Beancount ledger.

AILLMMachine LearningBeancountPlain-Text AccountingQueriesData Science

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

Chain-of-Table (ICLR 2024) improves LLM tabular reasoning by evolving the table itself as the intermediate state — achieving 67.31% on WikiTQ vs. 61.48% for prior baselines, with a +10.25 point advantage on tables exceeding 4,000 tokens and direct applicability to Beancount ledger query agents.

LLMAIMachine LearningBeancountPlain-Text AccountingOpen SourceQueries

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.

AIMachine LearningLLMData SciencePlain-Text AccountingBeancountQueriesAutomation

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

TAPAS (Google Research, ACL 2020) answers table questions by selecting cells and applying scalar aggregations — no SQL generated. This post analyzes the architecture, its 12-point SQA accuracy gain, and why the cell-selection paradigm fits small Beancount ledger queries but breaks down at scale.

AIMachine LearningDatabaseQueriesLLMBeancountAutomation

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

MAC-SQL (COLING 2025) uses three specialized agents — Selector for schema reduction, Decomposer for question decomposition, and Refiner for execution-guided SQL correction — to reach 59.59% execution accuracy on the BIRD benchmark; ablation shows the Refiner contributes the most (+4.63 points), with direct implications for Beancount ledger query generation.

AILLMDatabaseQueriesBeancountPlain-Text AccountingMachine Learning

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

DIN-SQL (NeurIPS 2023) decomposes text-to-SQL into schema linking, complexity classification, and SQL generation stages, lifting GPT-4 from 67.4% to 85.3% execution accuracy on Spider without fine-tuning — and the same decomposition strategy maps directly onto natural language interfaces for Beancount's BQL query language.

BeancountAILLMDatabaseQueriesMachine LearningPlain-Text Accounting

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

The BIRD benchmark (NeurIPS 2023) tests LLMs on 95 real databases — GPT-4 reaches only 54.89% execution accuracy with domain hints and 34.88% without, a 20-point gap that directly shapes what a natural-language BQL interface for Beancount would need to solve.

AILLMSecurityAutomationBeancountComplianceTrust

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

CMU and NC State researchers propose using System-Theoretic Process Analysis (STPA) and a capability-enhanced Model Context Protocol to derive formal safety specifications for LLM agent tool use, with Alloy-based verification demonstrating absence of unsafe flows in a calendar scheduling case study.

AILLMMachine LearningBeancountPlain-Text AccountingData ScienceQueries

GraphRAG: From Local to Global Query-Focused Summarization

Microsoft's GraphRAG builds a Leiden-partitioned entity graph over a text corpus and precomputes community summaries to answer global sensemaking questions that standard vector RAG cannot handle — but a 2025 bias audit shows its 72–83% win rates collapse after correcting for position and length artifacts in LLM-as-judge evaluation.

LLMAIFinancial ReportingMachine LearningBeancountCompliance

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

FinAuditing tests 13 LLMs zero-shot on 1,102 real SEC XBRL filing instances; top scores are 13.86% on financial math verification and 12.42% on concept retrieval—results that directly bound what AI accounting tools can be trusted to automate without external tooling.

Everything About AI

τ-bench: Measuring AI Agent Reliability in Real-World Tool-Use Domains

Chain-of-Table: Evolving Tables in the LLM Reasoning Chain

TableLlama: Can a 7B Open Model Match GPT-4 on Table Understanding?

TAPAS: Weakly Supervised Table QA Without SQL, and What It Means for Beancount

MAC-SQL: Multi-Agent Collaborative Text-to-SQL

DIN-SQL: Decomposed In-Context Learning for Text-to-SQL

BIRD Benchmark: The Real-Database Gap in LLM Text-to-SQL

Verifiably Safe Tool Use for LLM Agents: STPA Meets MCP

GraphRAG: From Local to Global Query-Focused Summarization

FinAuditing: LLMs Score Under 14% on Real SEC XBRL Auditing Tasks

Get started with Beancount.io

Getting Started

Features

Community

Legal