Skip to main content

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The CodeAct paper made a compelling case for Python as the right action format for LLM agents. But choosing the right action format is only half the problem — you also have to demonstrate that agents can handle real-world task complexity, not just curated benchmarks. SWE-bench (arXiv:2310.06770), published by Carlos Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan from Princeton and presented at ICLR 2024, is the paper that forced the field to confront that gap directly.

The paper

2026-04-30-swe-bench-can-language-models-resolve-real-world-github-issues

"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" builds a benchmark of 2,294 task instances drawn from actual merged pull requests across 12 popular Python repositories — astropy, django, flask, matplotlib, pylint, pytest, requests, scikit-learn, seaborn, sphinx, sympy, and xarray. Each instance presents the model with a codebase snapshot and a GitHub issue description; the model must produce a patch that makes a designated set of previously-failing tests pass without breaking existing tests. The benchmark was constructed by mining ~90,000 PRs, filtering for those that both resolved a linked issue and added new tests, and then verifying execution-based pass/fail transitions. This disciplined construction avoids the typical benchmark problem of ambiguous or easily-gameable ground truth.

Key ideas

  • Claude 2, the best-performing model at publication, resolves only 1.96% of issues using BM25 sparse retrieval — the realistic deployment setting where the model must find relevant files on its own.
  • With oracle retrieval — where the model is handed exactly the files it needs — Claude 2 improves to 4.80%, confirming that the bottleneck is partly retrieval but mainly editing: even with perfect context, success rates stay under 5%.
  • GPT-4 resolves 0.00% of issues with BM25 retrieval (evaluated on a 25% subset for budget reasons) and 1.74% with oracle. The oracle-to-BM25 drop for Claude 2 is severe; for GPT-4 it's total.
  • Generated patches are systematically too short: Claude 2's successful patches average 19.6 lines, versus 74.5 lines for the gold human patches. Models find simple local fixes; humans write comprehensive cross-file solutions.
  • More context actively hurts. BM25 at 50k tokens retrieves more of the oracle files than at 13k tokens, yet resolution rates often decline. The "lost in the middle" effect — models ignoring relevant evidence buried in long contexts — is a real and documented failure mode here.
  • SWE-Llama 13b, fine-tuned on oracle-retrieved contexts, achieves 4.00% with oracle but only 0.70% with BM25. Training on perfect context creates brittle agents that collapse when retrieval is realistic.

What holds up — and what doesn't

The benchmark construction is rigorous. Execution-based evaluation — tests actually run, pass or fail — is the right ground truth for code editing tasks. It's objective, automatable, and hard to game. The decision to require fail-to-pass transitions (not just successful patch application) is particularly important: it prevents trivially correct patches like no-ops or deletions.

The results have held up remarkably well. SWE-bench was published in October 2023 and rapidly became the de facto evaluation for coding agents. The initial 1.96% baseline is genuinely informative, not cherry-picked. SWE-agent, published in 2024 by a related group, moved the needle to 12.47% by redesigning the agent-computer interface — a 6x improvement that itself confirms how much the original baseline left on the table.

Two things the paper doesn't handle well: First, the benchmark is Python-only. That's a practical necessity, but it creates real risk of overfitting to Python conventions. Second, the paper proposes only retrieval-augmented generation baselines and explicitly defers to future work for agent-based approaches. That deferral was appropriate in 2023 but means the paper itself provides no signal on what agent architectures help.

The "oracle" setting is also a weaker upper bound than it looks. Providing perfect file context doesn't solve code localization within those files, and it doesn't help with multi-file reasoning about interactions between modules. Claude 2 at 4.8% oracle means that even with the right files in context, the model fails 95% of the time. The problem isn't primarily retrieval.

Why this matters for finance AI

Beancount is a Python project hosted on GitHub. A write-back agent for Beancount is, in essence, an agent that needs to pass SWE-bench-style tasks: given a ledger file and an instruction ("add this transaction," "fix this balance discrepancy"), produce an edit that passes bean-check without breaking existing assertions.

The retrieval failure is directly analogous to the ledger localization problem. When a user says "fix the overstatement in Q3 office supplies," the agent must first find the relevant entries in a file that might contain thousands of lines — the same file localization task where BM25 fails for 40–50% of SWE-bench instances. The "lost in the middle" degradation applies equally to long .beancount files, where earlier dated entries are just as likely to be ignored.

The patch length asymmetry is a practical warning. Models patch too narrowly. In accounting that translates to fixing one entry while missing the offsetting entry, or updating the expense line while leaving the running balance stale. A production Beancount agent needs a validation layer — bean-check, balance assertions, or an explicit reconciliation pass — that forces the agent to see the full consequence of its edit, not just its local plausibility.

The oracle/BM25 gap is also a reminder that retrieval quality is not separable from agent quality. An agent that can't reliably identify which accounts or files are relevant to a user's question will fail at the ledger navigation step before it even attempts an edit.

  • SWE-agent (arXiv:2405.15793, NeurIPS 2024) — the direct follow-up; moves from 1.96% to 12.47% by redesigning the agent-computer interface. The design principles for file navigation and code search are directly applicable to Beancount agent tooling.
  • Agentless: Demystifying LLM-based Software Engineering Agents (arXiv:2407.01489) — strips away agent complexity and shows that a simple localization + repair pipeline without scaffolding can be competitive; a useful counterpoint to interface-heavy approaches.
  • MemGPT: Towards LLMs as Operating Systems (arXiv:2310.08560) — addresses the long-context problem head-on with tiered memory management; directly relevant to agents that must reason over multi-year Beancount ledgers.