CodeAct: Why Executable Python Code Makes LLM Agents 20% More Accurate
After reading the "cannot self-correct" paper last week, a natural next question is: if LLMs can't reliably audit their own output, what action format gives agents the best chance of detecting and recovering from errors automatically? CodeAct, published by Xingyao Wang et al. and accepted at ICML 2024, argues the answer is Python code — not because code is magic, but because a Python interpreter provides exactly the kind of external, deterministic feedback that the self-correction literature shows LLMs desperately need.
The paper
"Executable Code Actions Elicit Better LLM Agents" (arXiv:2402.01030) by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji proposes replacing the JSON and text action formats common in tool-calling agents with executable Python code. The core idea is that code is a better lingua franca for agent actions than either natural language instructions or structured JSON, because code already encodes control flow, data dependencies, and multi-step composition — and because LLMs have been pretrained heavily on it.
The paper makes three contributions: (1) a conceptual argument for code as a unified action space; (2) M3ToolEval, a new benchmark of 82 human-curated tasks requiring multi-tool composition; and (3) CodeActAgent, a fine-tuned 7B model trained on CodeActInstruct, a dataset of 7,139 multi-turn code-based trajectories spanning information retrieval, software packages, external memory, and robot planning.
Key ideas
- On M3ToolEval, GPT-4 with CodeAct achieves 74.4% success rate versus 53.7% with text actions — a roughly 20 percentage-point absolute improvement in the most demanding multi-tool setting.
- CodeAct requires about 30% fewer interaction turns than JSON-based agents on the same tasks. Fewer turns matters: every extra round-trip is another opportunity for error propagation.
- The Python interpreter acts as an automatic, zero-cost error signal. A wrong intermediate calculation raises an exception immediately; the agent sees the traceback and can revise without a separate critique step.
- Open-source models benefit more than closed-source ones. CodeActAgent (Mistral 7B) reaches 12.2% on M3ToolEval versus 3.7% for the previously strongest open-source agent (Lemur-70B with text). The leverage is higher because Python is abundant in pretraining data; specialized JSON tool-calling formats are not.
- CodeActInstruct trains on four domains specifically chosen to stress-test composition: information seeking, package calls, external memory manipulation, and robot planning. These are all multi-step, state-dependent tasks — the exact failure modes where JSON agents break.
What holds up — and what doesn't
The 20% improvement on M3ToolEval is real, but M3ToolEval has 82 tasks. That's a small sample, and the paper doesn't report confidence intervals. The benchmark is also curated by the same team that proposes the method, which is standard in the field but worth flagging. I'd want to see this replicated on a fully independent benchmark before treating 74.4% as a reliable figure.
The efficiency claim — 30% fewer turns — is plausible but conflates two things. Fewer turns could mean the agent is more accurate per step, or it could mean failures terminate earlier. The paper doesn't decompose this cleanly.
The acknowledged gap between open and closed-source models is large and not explained away by CodeAct. CodeActAgent (Mistral 7B) at 12.2% is much better than Lemur-70B at 3.7%, but GPT-4 with CodeAct is at 74.4%. The format helps, but it doesn't close a 60-point capability gap. Anyone planning to deploy an open-source Beancount agent should read that number carefully.
The sandboxing question gets one paragraph. Arbitrary code execution in a finance context is not an inconvenient edge case — it's the primary security concern. The paper doesn't engage with what happens when the agent generates code that deletes files, makes network calls, or imports unexpected libraries. For a production accounting agent, the sandbox design is at least as important as the action format.
Why this matters for finance AI
The Beancount write-back problem is essentially the problem CodeAct is designed for: an agent needs to compose multiple operations (read current balance, validate transaction, write a new posting, verify the balance equation) in a specific order, with data flowing between steps. JSON tool-calling handles this poorly because each call is isolated. Python handles it naturally.
More concretely: a CodeAct-style Beancount agent could express an entire reconciliation workflow as a single Python script — querying the ledger via a library, computing deltas, proposing new entries, and running bean-check on the result — all before committing anything. The interpreter catches the obvious errors; the LLM only needs to handle the semantic ones. That's a better division of labor than asking the LLM to validate its own JSON.
The safety concern cuts the other way. An agent with unrestricted Python execution over a financial ledger is a significant attack surface. The right design is almost certainly a heavily restricted sandbox — no filesystem writes outside a designated temp directory, no network access, no shell commands — combined with a mandatory bean-check gate before any file is touched. CodeAct gives you the action format; you still have to build the cage.
What to read next
- OpenHands (formerly OpenDevin) — the production agent system built on CodeAct by the same research group; shows how the sandboxing and execution environment are actually implemented (arXiv:2407.16741)
- ToolBench / ToolLLM — benchmarks and training data for tool-calling agents using REST APIs rather than Python; a useful contrast to CodeAct's code-first approach (arXiv:2307.16789)
- SWE-bench — evaluates agents on real GitHub issues, which requires multi-step code execution and file editing; the closest existing benchmark to what a Beancount write-back agent would need to pass (arXiv:2310.06770)
