89 posts tagged with "AI"

AILLMMachine LearningBeancountAutomationData ScienceQueriesPlain-Text Accounting

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

TableMaster is a prompting-only pipeline that reaches 78.13% on WikiTQ with GPT-4o-mini—13 points above Chain-of-Table—by combining table-of-focus extraction, semantic verbalization, and adaptive switching between text and symbolic reasoning. Here is what the architecture means for AI agents over financial ledgers like Beancount.

AILLMFraud DetectionMachine LearningData ScienceBeancountAutomation

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

GPT-4 achieves 74.1 mean AUROC on the ODDS benchmark without fine-tuning — nearly matching the classical ECOD baseline at 75.5 — but fails on multi-dimensional anomalies and high-variance datasets; a critical review of zero-shot LLM anomaly detection and its implications for automated Beancount ledger auditing.

AILLMMachine LearningFinanceFinancial ReportingData ScienceBeancount

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

DocFinQA replaces FinQA's curated 700-word passages with full 123,000-word SEC filings, exposing a 175× context increase that nearly halves GPT-4 accuracy on long documents. Retrieval pipelines fail to surface the right chunk 45% of the time at HR@3 — and long-context models are not a substitute.

AILLMAutomationMachine LearningFinanceEnterprise SoftwareProductivity

TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks

TheAgentCompany tests 175 real workplace tasks across a simulated intranet with GitLab, OwnCloud, and RocketChat. The best model (Gemini-2.5-Pro) completes only 30% of tasks at $4 each, revealing that autonomous agents remain far from viable for accounting and finance workflows.

AILLMAutomationBeancountPlain-Text AccountingMachine Learning

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

τ²-bench extends agent benchmarking to dual-control settings where both the AI and the user invoke tools over shared state — finding that active users cut success rates by 18–25 percentage points, with direct implications for Beancount agents sharing write access with human users.

AILLMAutomationEnterprise SoftwareMachine LearningProductivity

WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks

WorkArena++ (NeurIPS 2024) benchmarks 682 compositional enterprise tasks across three difficulty levels. GPT-4o solves 2.1% of them while humans solve 93.9%, isolating exactly why current AI agents fail at implicit-goal knowledge work and why that gap matters for autonomous accounting automation.

AILLMMachine LearningAutomationBeancountPlain-Text AccountingData Science

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

GAIA benchmarks 466 real-world tasks across three difficulty levels; frontier agents reached 74.55% in mid-2026 versus 92% for humans, and the remaining Level 3 gap maps directly to the multi-step coordination challenges in automated Beancount ledger workflows.

AIMachine LearningAutomationLLMTechnologyData ScienceAI Agents

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

OSWorld (NeurIPS 2024) benchmarks multimodal AI agents on 369 real desktop tasks across Ubuntu, Windows, and macOS — finding a 60-percentage-point gap between the best model (12.24%) and human performance (72.36%), with 75% of failures traced to visuomotor grounding errors rather than reasoning failures.

AILLMAutomationMachine LearningBeancountFavaWeb InterfaceOpen Source

WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do

GPT-4 completes only 14.41% of WebArena's 812 realistic web tasks while humans reach 78.24%; the dominant failure mode is false infeasibility — conservative refusal to act — with direct implications for any agent operating Fava or finance web UIs.

AILLMAutomationEnterprise SoftwareMachine LearningBeancountPlain-Text Accounting

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.

Everything About AI

TableMaster: Adaptive Reasoning for Table Understanding with LLMs

Zero-Shot Anomaly Detection with LLMs: How GPT-4 Performs on Tabular Data

DocFinQA: Long-Context Financial Reasoning on Full SEC Filings

TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks

τ²-bench: Measuring the Cost of Dual-Control in Conversational AI Agents

WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks

GAIA Benchmark: Measuring What Frontier AI Agents Can Actually Do

OSWorld: Desktop AI Agents Succeed on 12% of Tasks Where Humans Succeed on 72%

WebArena: The 812-Task Benchmark That Measures What Web Agents Actually Can and Cannot Do

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

Get started with Beancount.io

Getting Started

Features

Community

Legal