TheAgentCompany: Benchmarking LLM Agents on Real-World Enterprise Tasks
TheAgentCompany tests 175 real workplace tasks across a simulated intranet with GitLab, OwnCloud, and RocketChat. The best model (Gemini-2.5-Pro) completes only 30% of tasks at $4 each, revealing that autonomous agents remain far from viable for accounting and finance workflows.
