Tree of Thoughts: Deliberate Problem Solving with LLM Search

April 27, 2026 · 7 min read

Mike Thrift

Marketing Manager

After spending the last two entries on agents that self-correct via reflection (Reflexion) and tool-interactive critiquing (CRITIC), I wanted to step back and look at a more structural approach: what if the agent never commits to a single reasoning path in the first place? Tree of Thoughts (ToT) from Yao et al. (NeurIPS 2023) proposes exactly that — a search framework where the LLM explores a branching space of intermediate reasoning steps rather than one linear chain. I read it now because it represents the clearest formulation of deliberate search for LLM reasoning, and deliberate search is what you need when a single wrong intermediate step in a financial calculation can silently corrupt everything downstream.

The paper

2026-04-27-tree-of-thoughts-deliberate-problem-solving

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan introduce Tree of Thoughts as a generalization of chain-of-thought prompting. The key move is to treat intermediate reasoning steps as "thoughts" — coherent text units that can be independently evaluated — and to organize them into a tree rather than a chain. At each node, the model generates multiple candidate thoughts, evaluates each one (via a separate LLM call that scores states as "sure / maybe / impossible"), and then applies a standard search algorithm (BFS or DFS) to traverse the tree. If a branch looks dead, the model can prune it or backtrack — something neither CoT nor CoT-SC can do.

The paper evaluates on three tasks: Game of 24 (combine four numbers to reach 24 using arithmetic), Creative Writing (produce a coherent passage using four random sentence endings), and Mini Crosswords (solve a 5×5 crossword puzzle). All three require reasoning that can benefit from exploration and backtracking, which is exactly the setting the authors designed for.

Key ideas

On Game of 24, ToT with beam width b=5 achieves 74% success, versus 4% for GPT-4 with standard CoT and 9% for CoT-SC with 100 samples. That gap is striking.
GPT-3.5 + ToT reaches only 19% on the same task; the method's benefit is highly model-dependent. The thought generation quality of GPT-4 is what drives most of the gain — GPT-4 generation + GPT-3.5 evaluation achieves 64%, while GPT-3.5 generation + GPT-4 evaluation achieves only 31%.
For Creative Writing, ToT scores 7.56 vs CoT's 6.93 on a GPT-4 coherency scale, and human annotators prefer ToT outputs 41/100 times versus 21/100 for CoT.
Mini Crosswords: ToT achieves 60% word-level accuracy (CoT: 40.6%, IO: 15.6%) but only solves 4 out of 20 full games (20%). The gap between word-level and game-level success reveals that even with backtracking, global constraint satisfaction remains hard.
The evaluation step is itself an LLM call. On crosswords, the paper notes that evaluators sometimes deem correct partial states "impossible" due to unfamiliar vocabulary — a compounding failure mode where the evaluator's mistakes poison the search.
Compute cost: ToT costs approximately $0.74 per case on Game of 24 versus $0.47 for best-of-100 CoT. The authors themselves flag that for tasks GPT-4 already handles well, the overhead is not worth it.

What holds up — and what doesn't

The core result — that tree search over intermediate thoughts massively outperforms sequential CoT on tasks requiring backtracking — is real and reproducible. The 74% vs 4% gap on Game of 24 is not noise. The explanation is mechanically sound: a single bad intermediate equation in CoT sends the rest of the chain off a cliff, while ToT can prune that branch and try a different decomposition.

What I find less convincing is the generalizability claim. All three evaluation tasks are relatively synthetic: a math puzzle, a creative writing prompt with structural constraints, and a word game. None of them resemble the open-ended, ambiguous problems that appear in production finance workflows. The authors also evaluate only on GPT-4 (and GPT-3.5 as an ablation), so we don't know how ToT performs with smaller or fine-tuned models — and the 19% figure for GPT-3.5 suggests the answer is "not well."

The game-level crossword failure (20% despite 60% word accuracy) points to a deeper issue: ToT is a local search guided by a local evaluator. It does not maintain a global constraint model, which is exactly what you need for problems where subsolution interactions are dense. The follow-on Graph of Thoughts paper (Besta et al., AAAI 2024) makes this criticism explicit and demonstrates a 62% quality improvement over ToT on sorting tasks while reducing cost by over 31% — by allowing thoughts to merge and form cycles rather than being confined to a tree.

Finally, the cost structure matters in practice. At b=5 with repeated evaluator calls, ToT is roughly 15–20× more expensive in API calls than a single CoT pass. For latency-sensitive or cost-sensitive applications, this is not trivially acceptable.

Why this matters for finance AI

The honest answer is: ToT matters most for a narrow slice of the Beancount problem space, but that slice is real.

The canonical finance task where I want backtracking is multi-step account classification with ambiguous transactions. When an LLM is mapping an imported bank statement to a chart of accounts, a single wrong assignment early in the chain (say, treating a loan disbursement as income) can cascade into a broken balance check several steps later. In a CoT agent, by the time the balance fails, the model has no mechanism to revisit the original classification. A ToT agent could backtrack to that node and try Liabilities:Loans instead.

Similarly, tax optimization across a full fiscal year is a genuine tree search problem: itemizing vs. taking the standard deduction, timing capital gains realizations, bunching charitable contributions. These decisions interact non-linearly, and you need to evaluate multiple branches before committing. ToT's BFS/DFS framing maps naturally onto that structure.

What ToT does not help with is the dominant case in Beancount: routine transaction entry and reconciliation. For a transaction that has a clear counterpart in the ledger, CoT + PAL (offload arithmetic to a code interpreter) is faster, cheaper, and already accurate enough. Bringing ToT to bear on expenses:groceries classification is using a sledgehammer on a pushpin.

The more pressing concern for write-back safety is the evaluator reliability problem. If the state evaluator is also an LLM, it can be wrong — and wrong evaluations don't just slow the search, they prune correct paths. Any production finance agent using ToT would need an external oracle (a balance check, a schema validator, a rule engine) to serve as the evaluator, not another LLM call.

What to read next

Graph of Thoughts: Solving Elaborate Problems with Large Language Models (Besta et al., AAAI 2024) — arXiv:2308.09687. Extends ToT from trees to arbitrary graphs, enabling thought merging and feedback loops. The cost reduction claim (>31%) is directly relevant if you want search-based reasoning without ToT's overhead.
Large Language Models Cannot Self-Correct Reasoning Yet (Huang et al., ICLR 2024) — arXiv:2310.01798. A critical counterpoint: without external feedback, intrinsic self-correction degrades reasoning performance. This challenges the assumption that ToT's LLM-based evaluator is reliable enough to guide search.
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation (arXiv:2409.09584) — applies MCTS instead of BFS/DFS to thought search, with execution feedback as the external oracle. The code generation setting is structurally similar to ledger write-back: you have a verifiable ground truth (does the code run? does the balance check pass?), which is exactly where Monte Carlo rollouts add value.

Share on Twitter Follow @beancount_io

Tree of Thoughts: Deliberate Problem Solving with LLM Search

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper​

Key ideas​

What holds up — and what doesn't​

Why this matters for finance AI​

What to read next​

Get started with Beancount.io

Getting Started

Features

Community

Legal

The paper

Key ideas

What holds up — and what doesn't

Why this matters for finance AI

What to read next