Skip to main content

Tree of Thoughts: Deliberate Problem Solving with LLM Search

· 7 min read
Mike Thrift
Mike Thrift
Marketing Manager

After spending the last two entries on agents that self-correct via reflection (Reflexion) and tool-interactive critiquing (CRITIC), I wanted to step back and look at a more structural approach: what if the agent never commits to a single reasoning path in the first place? Tree of Thoughts (ToT) from Yao et al. (NeurIPS 2023) proposes exactly that — a search framework where the LLM explores a branching space of intermediate reasoning steps rather than one linear chain. I read it now because it represents the clearest formulation of deliberate search for LLM reasoning, and deliberate search is what you need when a single wrong intermediate step in a financial calculation can silently corrupt everything downstream.

The paper

2026-04-27-tree-of-thoughts-deliberate-problem-solving

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan introduce Tree of Thoughts as a generalization of chain-of-thought prompting. The key move is to treat intermediate reasoning steps as "thoughts" — coherent text units that can be independently evaluated — and to organize them into a tree rather than a chain. At each node, the model generates multiple candidate thoughts, evaluates each one (via a separate LLM call that scores states as "sure / maybe / impossible"), and then applies a standard search algorithm (BFS or DFS) to traverse the tree. If a branch looks dead, the model can prune it or backtrack — something neither CoT nor CoT-SC can do.

The paper evaluates on three tasks: Game of 24 (combine four numbers to reach 24 using arithmetic), Creative Writing (produce a coherent passage using four random sentence endings), and Mini Crosswords (solve a 5×5 crossword puzzle). All three require reasoning that can benefit from exploration and backtracking, which is exactly the setting the authors designed for.

Key ideas

  • On Game of 24, ToT with beam width b=5 achieves 74% success, versus 4% for GPT-4 with standard CoT and 9% for CoT-SC with 100 samples. That gap is striking.
  • GPT-3.5 + ToT reaches only 19% on the same task; the method's benefit is highly model-dependent. The thought generation quality of GPT-4 is what drives most of the gain — GPT-4 generation + GPT-3.5 evaluation achieves 64%, while GPT-3.5 generation + GPT-4 evaluation achieves only 31%.
  • For Creative Writing, ToT scores 7.56 vs CoT's 6.93 on a GPT-4 coherency scale, and human annotators prefer ToT outputs 41/100 times versus 21/100 for CoT.
  • Mini Crosswords: ToT achieves 60% word-level accuracy (CoT: 40.6%, IO: 15.6%) but only solves 4 out of 20 full games (20%). The gap between word-level and game-level success reveals that even with backtracking, global constraint satisfaction remains hard.
  • The evaluation step is itself an LLM call. On crosswords, the paper notes that evaluators sometimes deem correct partial states "impossible" due to unfamiliar vocabulary — a compounding failure mode where the evaluator's mistakes poison the search.
  • Compute cost: ToT costs approximately $0.74 per case on Game of 24 versus $0.47 for best-of-100 CoT. The authors themselves flag that for tasks GPT-4 already handles well, the overhead is not worth it.

What holds up — and what doesn't

The core result — that tree search over intermediate thoughts massively outperforms sequential CoT on tasks requiring backtracking — is real and reproducible. The 74% vs 4% gap on Game of 24 is not noise. The explanation is mechanically sound: a single bad intermediate equation in CoT sends the rest of the chain off a cliff, while ToT can prune that branch and try a different decomposition.

What I find less convincing is the generalizability claim. All three evaluation tasks are relatively synthetic: a math puzzle, a creative writing prompt with structural constraints, and a word game. None of them resemble the open-ended, ambiguous problems that appear in production finance workflows. The authors also evaluate only on GPT-4 (and GPT-3.5 as an ablation), so we don't know how ToT performs with smaller or fine-tuned models — and the 19% figure for GPT-3.5 suggests the answer is "not well."

The game-level crossword failure (20% despite 60% word accuracy) points to a deeper issue: ToT is a local search guided by a local evaluator. It does not maintain a global constraint model, which is exactly what you need for problems where subsolution interactions are dense. The follow-on Graph of Thoughts paper (Besta et al., AAAI 2024) makes this criticism explicit and demonstrates a 62% quality improvement over ToT on sorting tasks while reducing cost by over 31% — by allowing thoughts to merge and form cycles rather than being confined to a tree.

Finally, the cost structure matters in practice. At b=5 with repeated evaluator calls, ToT is roughly 15–20× more expensive in API calls than a single CoT pass. For latency-sensitive or cost-sensitive applications, this is not trivially acceptable.

Why this matters for finance AI

The honest answer is: ToT matters most for a narrow slice of the Beancount problem space, but that slice is real.

The canonical finance task where I want backtracking is multi-step account classification with ambiguous transactions. When an LLM is mapping an imported bank statement to a chart of accounts, a single wrong assignment early in the chain (say, treating a loan disbursement as income) can cascade into a broken balance check several steps later. In a CoT agent, by the time the balance fails, the model has no mechanism to revisit the original classification. A ToT agent could backtrack to that node and try Liabilities:Loans instead.

Similarly, tax optimization across a full fiscal year is a genuine tree search problem: itemizing vs. taking the standard deduction, timing capital gains realizations, bunching charitable contributions. These decisions interact non-linearly, and you need to evaluate multiple branches before committing. ToT's BFS/DFS framing maps naturally onto that structure.

What ToT does not help with is the dominant case in Beancount: routine transaction entry and reconciliation. For a transaction that has a clear counterpart in the ledger, CoT + PAL (offload arithmetic to a code interpreter) is faster, cheaper, and already accurate enough. Bringing ToT to bear on expenses:groceries classification is using a sledgehammer on a pushpin.

The more pressing concern for write-back safety is the evaluator reliability problem. If the state evaluator is also an LLM, it can be wrong — and wrong evaluations don't just slow the search, they prune correct paths. Any production finance agent using ToT would need an external oracle (a balance check, a schema validator, a rule engine) to serve as the evaluator, not another LLM call.

  • Graph of Thoughts: Solving Elaborate Problems with Large Language Models (Besta et al., AAAI 2024) — arXiv:2308.09687. Extends ToT from trees to arbitrary graphs, enabling thought merging and feedback loops. The cost reduction claim (>31%) is directly relevant if you want search-based reasoning without ToT's overhead.
  • Large Language Models Cannot Self-Correct Reasoning Yet (Huang et al., ICLR 2024) — arXiv:2310.01798. A critical counterpoint: without external feedback, intrinsic self-correction degrades reasoning performance. This challenges the assumption that ToT's LLM-based evaluator is reliable enough to guide search.
  • RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation (arXiv:2409.09584) — applies MCTS instead of BFS/DFS to thought search, with execution feedback as the external oracle. The code generation setting is structurally similar to ledger write-back: you have a verifiable ground truth (does the code run? does the balance check pass?), which is exactly where Monte Carlo rollouts add value.