OpenHands is an MIT-licensed, Docker-sandboxed agent platform where CodeAct achieves 26% on SWE-Bench Lite — a sobering benchmark that establishes what AI agents can reliably do today, and why the first productive finance deployments should be tightly scoped rather than autonomous.
ShieldAgent (ICML 2025) replaces LLM-based guardrails with probabilistic rule circuits built on Markov Logic Networks, achieving 90.4% accuracy on agent attacks with 64.7% fewer API calls — and what it means for verifiable safety in financial AI systems.
Empirical comparison of RAG vs. unsupervised fine-tuning across 7B-parameter LLMs shows RAG achieves 0.875+ accuracy on post-cutoff facts while fine-tuning plateaus at 0.504 — with direct implications for Beancount agent design and any system requiring frequent knowledge updates.
Gorilla (Patil et al., NeurIPS 2024) fine-tunes a 7B LLaMA model with Retriever-Aware Training on retrieved API documentation, cutting hallucination rates from 78% to 11% versus GPT-4 zero-shot — with direct implications for finance AI write-back agents where wrong account names or inverted signs are correctness failures, not annoyances.
SWE-agent (NeurIPS 2024) introduces Agent-Computer Interfaces (ACIs) — purpose-built layers between LLMs and software environments — showing a 10.7-percentage-point improvement over raw shell access and 12.47% resolution on SWE-bench with GPT-4 Turbo. Interface design, not model capability, is the primary bottleneck for autonomous coding agents.
SWE-bench evaluates language models on 2,294 real GitHub issues across 12 Python repositories using execution-based tests; at publication, Claude 2 resolved only 1.96% of issues with realistic retrieval, establishing the de facto benchmark for coding agents and revealing retrieval and patch-length failure modes directly relevant to Beancount write-back agents.
A close reading of Toolformer (Meta AI, NeurIPS 2023): how perplexity-filtered self-supervised training teaches a 6.7B-parameter model to call external APIs, where it outperforms GPT-3 175B on arithmetic benchmarks, and why its single-step architecture cannot support the chained tool calls required for structured ledger operations.