Skip to main content

Reflexion: Language Agents That Learn from Mistakes Without Retraining

· 5 min read
Mike Thrift
Mike Thrift
Marketing Manager

I've been thinking about what it would take to build a Beancount ledger agent that gets better over time without being retrained every time it makes a mistake. Shinn et al.'s "Reflexion: Language Agents with Verbal Reinforcement Learning" (NeurIPS 2023) directly addresses that question, and the answer is both promising and more constrained than the headline numbers suggest.

The paper

2026-04-25-reflexion-language-agents-verbal-reinforcement-learning

Reflexion (Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao; NeurIPS 2023) proposes that instead of adjusting model weights through expensive reinforcement learning, you can improve an agent by having it write its own failure analysis in natural language. After each attempt, a Self-Reflection model reads the trajectory and reward signal, produces a verbal post-mortem, and appends it to an episodic memory buffer. On the next trial, the Actor reads the accumulated reflections before acting. No gradient is computed. No model is fine-tuned. The "learning" lives entirely in the context window.

The three-component architecture — Actor, Evaluator, Self-Reflection — is modular enough to accommodate different task types. The Evaluator can be a binary unit-test result, an LLM judge, or a task-specific heuristic. This flexibility is what makes the paper interesting beyond coding benchmarks.

Key ideas

  • On HumanEval Python pass@1, Reflexion + GPT-4 reaches 91%, up from the 80% GPT-4 baseline — a genuine and large gap. On Leetcode Hard the jump is 7.5% → 15%, which is progress but also a reminder of how hard those problems remain.
  • On AlfWorld (text-based household planning), Reflexion solves 130/134 tasks after 12 trials vs. 108/134 for the ReAct baseline — the most compelling decision-making result in the paper.
  • On HotpotQA multi-hop QA, CoT + Reflexion goes from 61% to 75% exact match on 100 sampled questions.
  • On MBPP (a second Python benchmark), Reflexion slightly hurts performance: 80.1% → 77.1%. The paper buries this.
  • On WebShop, Reflexion fails to help. The authors attribute this to the task requiring "significant diversity and exploration" — the agent writes unhelpful reflections that don't generalize across product searches.
  • Memory is capped at 1–3 stored experiences. This is pragmatic given context length but means the agent cannot accumulate learning across long deployment.

What holds up — and what doesn't

The core claim is sound: verbal reflection improves performance on tasks with clear, verifiable feedback. If you know whether the code passed its unit tests, the reflection module has something concrete to reason about. The AlfWorld and HumanEval results are real and meaningful.

But the WebShop failure is instructive and the paper somewhat underplays it. Reflection works when the evaluator can produce a crisp, actionable signal. When the failure mode is "the agent explored the wrong part of a large search space," telling it to "try different search terms next time" does not converge. This is a structural limitation: verbal reinforcement is not a substitute for exploration strategies.

The coding experiments also have a circularity the authors acknowledge in their blog post: the agent generates its own unit tests to evaluate its own code. A flawed test suite produces false positives. The HumanEval 91% number holds because HumanEval provides ground-truth tests, but the agent's self-evaluation loop is less trustworthy on novel problems where no external oracle exists.

Reproducibility is a real concern. All main results use GPT-4, and the starchat-beta experiments show no improvement over baseline, meaning the technique is capability-gated. Teams running smaller or open-weight models should not expect the same gains.

Why this matters for finance AI

The Beancount use case has exactly the property that makes Reflexion work well: a clear evaluator. If an agent incorrectly categorizes a transaction, the ledger balance check or a reconciliation step can produce a binary signal — the books balance or they don't. That is a much better feedback surface than WebShop's ambiguous product-search reward.

Concretely, I can imagine a Beancount write-back agent that, after a failed posting attempt (invalid account, wrong currency, assertion failure), generates a verbal reflection: "I used Expenses:Meals but this account requires a sub-category. Next time I will check the account hierarchy before posting." This reflection is stored and retrieved on the next similar transaction. The agent effectively accumulates a session-specific policy from its own errors.

The memory cap is the main architectural challenge. A 1–3 experience buffer is fine for a single session, but a deployed accounting agent needs to learn across thousands of transactions and weeks of operation. Extending Reflexion to long-horizon memory — perhaps by summarizing or indexing reflections — is an open problem. The paper does not solve it.

  • Language Agent Tree Search (LATS) (Zhou et al., arXiv:2310.04406; ICML 2024) — extends Reflexion by wrapping Monte Carlo Tree Search around the reflection-retry loop, letting agents explore multiple reasoning branches rather than committing to one trajectory. Achieves 92.7% on HumanEval with GPT-4.
  • Retroformer (Yao et al., arXiv:2308.02151; ICLR 2024) — rather than relying on the same LLM to self-reflect, Retroformer trains a separate lightweight retrospective model via policy gradient, making the reflection process learnable across tasks. More principled, but requires fine-tuning.
  • Self-Reflection in LLM Agents: Effects on Problem-Solving Performance (arXiv:2405.06682, 2024) — an empirical study specifically probing when and why reflection helps, with ablations across task types. Useful for calibrating when to apply Reflexion vs. other correction strategies.