Skip to main content

ReAct: Synergizing Reasoning and Acting in Language Models

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

ReAct (Yao et al., ICLR 2023) is the paper behind the reasoning-then-acting loop that most modern finance agents now use as a default scaffold. I've been putting it off because it feels like infrastructure — the kind of thing everyone already knows — but after spending time with autonomous ledger write-back, I wanted to understand the failure modes at the source, not from downstream folklore.

The paper

2026-04-17-react-synergizing-reasoning-and-acting-in-language-models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao propose a deceptively simple idea: instead of asking a language model to either reason (chain-of-thought) or act (invoke tools), let it do both in an interleaved stream. Each step in the trajectory is either a Thought (free-form reasoning about what to do next) or an Act/Obs pair (an action and its observation from the environment). The claim is that this interleaving is synergistic — reasoning shapes which actions to take, and observations reshape the reasoning.

They test this on four benchmarks: HotpotQA and Fever (knowledge-intensive QA and fact verification, using a Wikipedia search API as the action space), and ALFWorld and WebShop (embodied and simulated e-commerce environments requiring multi-step decision-making). All experiments use PaLM-540B and GPT-3 (text-davinci-002) under few-shot prompting with as few as one or two in-context examples.

Key ideas

  • On ALFWorld, ReAct outperforms imitation learning and reinforcement learning baselines by 34 percentage points absolute in task success rate; on WebShop, the gain is 10 percentage points absolute.
  • On Fever (fact verification), ReAct outperforms chain-of-thought. On HotpotQA (multi-hop QA), CoT actually beats ReAct — the paper acknowledges this directly rather than burying it.
  • The failure cases split into two types: reasoning errors (the model misjudges what information it has) and search errors (a non-informative Wikipedia result derails the subsequent reasoning chain). These are qualitatively different and require different mitigations.
  • The format itself is interpretable: a human can read the Thought trace, find the mistake, and correct it by editing a single line. This is explicitly called out as a safety property.
  • Fine-tuning smaller models on ReAct trajectories lets them outperform prompted larger models — suggesting the interleaved format is learnable, not just a prompting trick.

What holds up — and what doesn't

The interactive decision-making results (ALFWorld, WebShop) are the strongest part of the paper. The gap over pure imitation learning is large enough that it's hard to attribute to hyperparameter luck. The reasoning traces are genuinely readable, and the error analysis distinguishing search failures from reasoning failures is honest and useful.

The knowledge-intensive QA results are weaker and the paper knows it. ReAct losing to CoT on HotpotQA is a real data point: when the answer can be reached by chaining internal model knowledge, the friction of tool invocations actually hurts. The model sometimes retrieves a Wikipedia passage that is tangentially related, anchors on it, and then produces worse reasoning than if it had just stayed in its head. The paper calls this "search-induced distraction" and it is not fixed by the architecture — it is a retrieval quality problem dressed up as an agent problem.

There's also a fundamental evaluation issue that the paper inherits from the benchmarks themselves: both ALFWorld and WebShop have relatively constrained action spaces compared to what a real-world agent needs. The 34% improvement on ALFWorld is impressive within the game, but ALFWorld is a simulated household environment with a small fixed vocabulary of actions. Generalizing from that to, say, a Beancount ledger with an open-ended transaction schema requires extrapolation the paper doesn't justify.

The few-shot setup is both a strength and a weakness. One or two in-context examples is impressive, but it also means the results are highly sensitive to which examples are chosen. I didn't find ablations over example selection in the paper, which would have been useful.

Why this matters for finance AI

The write-back safety problem for autonomous Beancount agents is exactly the failure regime ReAct illuminates. If an agent is reasoning through a transaction categorization decision and retrieves an ambiguous ledger entry — one that could map to either Expenses:Food or Expenses:Entertainment — the ReAct pattern will anchor the subsequent reasoning on whichever interpretation the first retrieved entry suggests. This is the finance analog of "search-induced distraction," and it does not go away by prompting more carefully.

The interpretability argument matters more here than the paper probably intended. In accounting, an auditor doesn't just need the right answer — they need a traceable chain of reasoning they can sign off on. ReAct's Thought traces give you that chain, and the observation that a human can correct a trajectory by editing one Thought is directly applicable to a human-in-the-loop review step before any journal entry commits to the ledger.

The failure mode I care about most, though, is compounding errors on long-horizon tasks. A reconciliation job touching fifty transactions has many more opportunities for a Thought to go wrong than a single-hop Wikipedia lookup. ReAct provides no native mechanism for the agent to detect that it has drifted — it just keeps going. Reflexion (Shinn et al., arXiv:2303.11366) addresses this by adding a verbal self-evaluation step, and ReAct + Reflexion closes 130 of 134 ALFWorld tasks compared to ReAct alone. That delta tells you how much value there is in adding a recovery loop on top of the basic ReAct scaffold.

  • Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023, arXiv:2303.11366) — adds a self-reflection step that lets a ReAct agent revise its strategy across episodes; the most direct extension for ledger agents that need to recover from mid-trajectory errors.
  • FireAct: Toward Language Agent Fine-tuning (Chen et al., 2023, arXiv:2310.05915) — fine-tunes models specifically on ReAct trajectories across multiple tools; relevant for training a Beancount-specific agent on real ledger tool calls.
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023, arXiv:2305.10601) — explores search over reasoning paths rather than committing to a single chain; matters for cases where the first ReAct trajectory is wrong and needs systematic backtracking.