Enterprise Software

Everything About Enterprise Software

3 articles

Enterprise software automation, web agents, and knowledge work task research

AILLMAutomationMachine LearningFinanceEnterprise SoftwareProductivity

TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks

TheAgentCompany tests 175 real workplace tasks across a simulated intranet with GitLab, OwnCloud, and RocketChat. The best model (Gemini-2.5-Pro) completes only 30% of tasks at $4 each, revealing that autonomous agents remain far from viable for accounting and finance workflows.

AILLMAutomationEnterprise SoftwareMachine LearningProductivity

WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks

WorkArena++ (NeurIPS 2024) benchmarks 682 compositional enterprise tasks across three difficulty levels. GPT-4o solves 2.1% of them while humans solve 93.9%, isolating exactly why current AI agents fail at implicit-goal knowledge work and why that gap matters for autonomous accounting automation.

AILLMAutomationEnterprise SoftwareMachine LearningBeancountPlain-Text Accounting

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

WorkArena benchmarks LLM web agents on 33 real ServiceNow tasks — GPT-4o reaches 42.7% overall but 0% on list-filter tasks, exposing a hard wall between form-filling and structured UI interaction that maps directly to challenges in Beancount ledger automation.

Get started with Beancount.io

Take control of your finances with our open-source double-entry accounting system. Start your ledger today.

Get Started Free View Pricing

Built with transparency • Version controlled • AI-powered

Everything About Enterprise Software

TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks

WorkArena++: The 93% Gap Between Human and AI Agent Performance on Compositional Enterprise Tasks

WorkArena: How LLM Web Agents Perform on Real Enterprise Knowledge Work

Get started with Beancount.io

Getting Started

Features

Community

Legal