ReDAct runs a small model by default and escalates to an expensive model only when token-level perplexity signals uncertainty, achieving 64% cost savings over GPT-5.2-only while matching or exceeding its accuracy — a directly applicable pattern for Beancount transaction-categorization agents.
InvestorBench (ACL 2025) tests 13 LLM backbones on backtested stock, crypto, and ETF trading using cumulative return and Sharpe ratio — not QA accuracy. Qwen2.5-72B tops the stock leaderboard at 46.15% CR; finance-tuned models backfire on equities. Model size predicts performance more reliably than domain fine-tuning.
LATS(Language Agent Tree Search, ICML 2024)는 ReAct, Tree of Thoughts, Reflexion을 단일 MCTS 프레임워크로 통합하여 GPT-4와 함께 HumanEval에서 92.7%의 pass@1을 달성했습니다. Git 기반의 Beancount 장부의 경우, 운영 환경에서 LATS를 제한하는 상태 복원 요구 사항을 아주 쉽게 충족할 수 있습니다.
Tree of Thoughts (ToT) achieves 74% on Game of 24 vs 4% for standard GPT-4 CoT by organizing LLM reasoning into a branching search tree with pruning and backtracking — with direct implications for multi-step financial classification and tax optimization in Beancount workflows.