OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.
GPT-4 completes only 14.41% of WebArena's 812 realistic web tasks while humans reach 78.24%; the dominant failure mode is false infeasibility — conservative refusal to act — with direct implications for any agent operating Fava or finance web UIs.
TableLlama fine-tunes Llama 2 (7B) on 2.6M table-task examples and beats GPT-4 on structural tasks like column type annotation (F1 94 vs 32), but falls 33 points short on WikiTQ compositional reasoning — a calibrated benchmark for what 7B open models can and cannot do in finance AI today.
SWE-agent (NeurIPS 2024) introduces Agent-Computer Interfaces (ACIs) — purpose-built layers between LLMs and software environments — showing a 10.7-percentage-point improvement over raw shell access and 12.47% resolution on SWE-bench with GPT-4 Turbo. Interface design, not model capability, is the primary bottleneck for autonomous coding agents.